US20240037969A1 - Recognition of handwritten text via neural networks - Google Patents
Recognition of handwritten text via neural networks Download PDFInfo
- Publication number
- US20240037969A1 US20240037969A1 US18/484,110 US202318484110A US2024037969A1 US 20240037969 A1 US20240037969 A1 US 20240037969A1 US 202318484110 A US202318484110 A US 202318484110A US 2024037969 A1 US2024037969 A1 US 2024037969A1
- Authority
- US
- United States
- Prior art keywords
- image
- images
- fragment
- fragment images
- fragmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title description 20
- 239000012634 fragment Substances 0.000 claims abstract description 89
- 238000013467 fragmentation Methods 0.000 claims abstract description 39
- 238000006062 fragmentation reaction Methods 0.000 claims abstract description 39
- 230000006870 function Effects 0.000 claims description 73
- 238000012545 processing Methods 0.000 claims description 52
- 238000000034 method Methods 0.000 claims description 49
- 230000015654 memory Effects 0.000 claims description 16
- 230000000877 morphologic effect Effects 0.000 claims description 6
- 238000013145 classification model Methods 0.000 claims 3
- 238000001514 detection method Methods 0.000 claims 3
- 239000013598 vector Substances 0.000 description 52
- 238000003062 neural network model Methods 0.000 description 22
- 238000013527 convolutional neural network Methods 0.000 description 20
- 238000012549 training Methods 0.000 description 18
- 238000010586 diagram Methods 0.000 description 16
- 239000000470 constituent Substances 0.000 description 12
- 238000011176 pooling Methods 0.000 description 11
- 230000011218 segmentation Effects 0.000 description 11
- 230000008569 process Effects 0.000 description 8
- 230000000007 visual effect Effects 0.000 description 6
- 238000012937 correction Methods 0.000 description 4
- 238000012015 optical character recognition Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000002950 deficient Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012886 linear function Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005204 segregation Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/22—Character recognition characterised by the type of writing
- G06V30/224—Character recognition characterised by the type of writing of printed characters having additional code marks or containing code marks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Definitions
- the present disclosure is generally related to computer systems, and is more specifically related to systems and methods for recognition of handwritten text via neural networks.
- Embodiments of the present disclosure describe a system and method to recognize handwritten text (including hieroglyphic symbols) using deep neural network models.
- a system receives an image depicting a line of text. The system segments the image into two or more fragment images. For each of the two or more fragment images, the system determines a first hypothesis to segment the fragment image into a first plurality of grapheme images and a first fragmentation confidence score. The system determines a second hypothesis to segment the fragment image into a second plurality of grapheme images and a second fragmentation confidence score. The system determines that the first fragmentation confidence score is greater than the second fragmentation confidence score. The system translates the first plurality of grapheme images defined by the first hypothesis to symbols. The system assembles the symbols of each fragment image to derive the line of text.
- FIGS. 1 A- 1 B depict high level system diagrams of an example text recognition system in accordance with one or more aspects of the present disclosure.
- FIG. 2 illustrates an example of an image, fragment images, and grapheme images in accordance with one or more aspects of the present disclosure.
- FIG. 3 illustrates a block diagram of a grapheme recognizer module in accordance with one or more aspects of the present disclosure.
- FIG. 4 illustrates a block diagram of a first level neural network in accordance with one or more aspects of the present disclosure.
- FIG. 5 illustrates a block diagram of a second level neural network in accordance with one or more aspects of the present disclosure.
- FIG. 6 schematically illustrates an example confidence function Q(d) implemented in accordance with one or more aspects of the present disclosure.
- FIG. 7 depicts a flow diagram of a method to recognize a line of text in accordance with one or more aspects of the present disclosure.
- FIG. 8 depicts a flow diagram of a method to recognize a grapheme in accordance with one or more aspects of the present disclosure.
- FIG. 9 depicts a block diagram of an illustrative computer system in accordance with one or more aspects of the present disclosure.
- Optical character recognition may involve matching a given grapheme image with a list of candidate symbols, followed by determining the probability associated with each candidate symbol. The higher the probability of match, the higher the likelihood that the candidate symbol is the correct symbol.
- the text recognition process may extract computer-readable and searchable textual information from indicia-bearing images of various media (such as hand printed paper documents, banners, posters, signs, billboards, and/or other physical objects bearing visible graphemes on one or more of their surfaces).
- Grapheme herein shall refer to the elementary unit of a writing system of a given language.
- a grapheme may be represented, e.g., by a logogram representing a word or a morpheme, a syllabic character representing a syllable, or an alphabetic character representing a phoneme.
- “Fragment” herein shall refer to a paragraph, a sentence, a title, a part of a sentence, a word combination, for example, a noun group, etc.
- “Handwritten text” or “handwritten characters” is broadly understood to include any characters, including cursive and print characters, that were produced by hand using any reasonable writing instrument (such as pencil, pen, etc.) on any suitable substrate (such as paper) and further includes characters that were generated by a computer in accordance with user interface input received from a pointing device (such as stylus).
- a string of handwritten characters may include visual gaps between individual characters (graphemes).
- a string of handwritten characters may include one or more conjoined characters with no visual gaps between individual characters (graphemes).
- the text recognition process involves a segmentation stage and a stage of recognizing individual characters. Segmentation involves dividing an image into fragment images and then into grapheme images that contain respective individual characters. Different variants of segmentation or hypotheses may be generated and/or evaluated and the best hypothesis may then be selected based on some predetermined criteria.
- ambiguity e.g., the same image in different contexts may correspond to different characters
- incorrectly segmented images may be used as the input for the individual character recognition.
- the problems in ambiguity and incorrect segmentation may be solved by verification of the character (or grapheme image) at a higher (e.g., image fragment) level.
- a classification confidence score may be provided for each individual character and a recognized character with a classification confidence score below a predetermined threshold may be discarded to improve the character recognition.
- the classification confidence score may be generated by a combination of a structural classifier and a neural network classifier, as further described below.
- the neural network classifier may be trained with positive and negative (invalid/defective) image samples with a loss function that is a combination of center loss, cross entropy loss, and close-to-center penalty loss. Negative image samples may be used as an additional class in neural network classifier.
- the combined approach described herein represent significant improvements over various common methods, by employing hypotheses of segmentation and generation of confidence scores for the hypotheses using loss functions.
- the loss functions can specifically be aimed at training the neural network to recognize valid and invalid or defective grapheme images, thus improving the overall quality and efficiency of optical character recognition.
- each confidence score e.g., selected from the range of 0-1
- each confidence score reflects the level of confidence of the hypothesis of the input grapheme image representing an instance of a certain class of the set of grapheme classes, as described in more detail herein below.
- FIGS. 1 A- 1 B depict high level system diagrams of example text recognition systems in accordance with one or more aspects of the present disclosure.
- the text recognition system 100 may include a text recognition component 110 that may perform optical character recognition for handwritten text.
- Text recognition component 110 may be a client-based application or may be a combination of a client component and a server component.
- text recognition component 110 may execute entirely on the client computing device such as a tablet computer, a smart phone, a notebook computer, a camera, a video camera, or the like.
- a client component of text recognition component 110 executing on a client computing device may receive a document image and transmit it to a server component of text recognition component 110 executing on a server device that performs the text recognition.
- the server component of text recognition component 110 may then return a list of symbols to the client component of text recognition component 110 executing on the client computing device for storage or to provide to another application.
- text recognition component 110 may execute on a server device as an Internet-enabled application accessible via a browser interface.
- the server device may be represented by one or more computer systems such as one or more server machines, workstations, mainframe machines, personal computers (PCs), etc.
- text recognition component 110 may include, but is not limited to, image receiver 101 , image segmenter module 102 , hypotheses module 103 , grapheme recognizer module 104 , confidence score module 105 , spacing and translation module 106 , and correction module 107 .
- One or more of components 101 - 107 , or a combination thereof, may be implemented by one or more software modules running on one or more hardware devices.
- Image receiver 101 may receive an image from various sources such as a camera, a computing device, a server, a handheld device.
- the image may be a document file, or a picture file with visible text.
- the image may be derived from various media (such as hand printed paper documents, banners, posters, signs, billboards, and/or other physical objects with visible texts on one or more of their surfaces.
- the image may be pre-processed by applying one or more image transformation to the received image, e.g., binarization, size scaling, cropping, color conversions, etc. to prepare the image for text recognition.
- Image segmenter module 102 may segment the received image into fragment images and then into grapheme images.
- the fragment images or grapheme images may be segmented by visual spacings or gaps in a line of text in the image.
- Hypotheses module 103 may generate different variants or hypotheses to segment a line of text into fragment image constituents, and a fragment image into one or more grapheme constituents.
- Grapheme recognizer module 104 may perform recognition for a grapheme image using neural network models 162 (as part of data store 160 ).
- Confidence score module 105 may determine a confidence score for a hypothesis based on the recognized grapheme. The confidence score may be a fragmentation confidence score (confidence for a fragment/word) or a classification confidence score (confidence for a symbol/character).
- Spacing and translation module 106 may add spaces (if necessary) to the grapheme and translate the grapheme into character symbols.
- correction module 107 may correct certain parts of the text taken into account a context of the text. The correction may be performed by verifying the character symbols with dictionary/morphological/syntactic rules 161 .
- modules 101 - 107 are shown separately, some of modules 101 - 107 , or functionalities thereof, may be combined together.
- FIG. 1 B depicts a high level system diagram of interactions of the modules 101 - 107 according to one embodiment.
- image receiver 101 may receive image 109 .
- the outputs of image receiver 101 are input to subsequent modules 102 - 107 to generate symbols 108 .
- modules 101 - 107 may be processed sequentially or in parallel.
- image segmenter module 102 may process image 109
- grapheme recognizer module 104 may process several graphemes in parallel, e.g., a number of hypotheses for fragments and/or graphemes may be processed in parallel.
- FIG. 2 illustrates an example of an image, fragment images, and grapheme images in accordance with one or more aspects of the present disclosure.
- image 109 may be a binarized image with a line of text “This sample” in the English language.
- segmentation module 102 of FIGS. 1 A- 1 B segments image 109 into hypotheses (or variants) 201 A-C using a linear division graph.
- a linear division graph is a graph with markings of all division points. I.e., if there are several ways to divide a string into words, and words into letters, all possible division points are marked and a piece of the image between two marked points is considered a candidate letter (or word).
- hypotheses 201 can be a variation of how many division points there are and/or where to place these division points to divide the line of text into one or more fragment images.
- hypothesis 201 A divides the line of text into two fragment images 203 A-B, e.g., “This” and “sample”. The segmentation may be performed based on visible vertical spacings/gaps in the line of text.
- the segmentation module 102 may further segment a fragment images 203 B into hypotheses (or variants) 205 A-C, where hypothesis 205 A is a single variation to divide fragment image 203 B into a plurality of grapheme images 207 .
- the segmentation points may be determined based on gaps in the fragment image. In some cases, the gaps are vertical gaps or slanted gaps (slanted at a particular angle of the handwriting) in the fragment image. In another embodiment, the division points may be determined based on areas of low pixel distributions of a vertical (or slanted) pixel profile of the fragment image. In some other embodiments machine learning may be used for segmentation.
- a hypothesis for division of graphemes may be evaluated for a fragmentation confidence score and the fragment that defined the hypothesis with the highest fragmentation confidence score is picked as the final fragment.
- constituents of the line of text e.g., fragments
- constituents the fragment images e.g., graphemes
- linear division graph data structure may be stored in a linear division graph data structure. Different paths of the linear division graph data structure may then be used to enumerate one or more hypotheses (or variations or combinations) of fragments/graphemes based on the segmentation/division points.
- hypotheses module 103 may generate three hypotheses 201 A-C for the fragment divisions.
- hypotheses module 103 For fragment 203 B, e.g., “sample”, hypotheses module 103 generate hypotheses 205 A-C, e.g., “s-a-m-p-l-e”, “s-a-mp-l-e”, and “s-a-m-p-le”, where ‘-’ represents the identified visual gaps or potential dividing gaps between the one or more graphemes.
- hypotheses 205 A-C e.g., “s-a-m-p-l-e”, “s-a-mp-l-e”, and “s-a-m-p-le”, where ‘-’ represents the identified visual gaps or potential dividing gaps between the one or more graphemes.
- hypotheses 205 B and 205 C represents variants where some symbols are “glued” together, i.e., i.e., a grapheme image with two or more symbols with no perceivable visual gaps in between. Note that the glued grapheme images would be discarded by grapheme recognition module 104 as described further below.
- FIG. 3 illustrates a block diagram of a grapheme recognizer module in accordance with one or more aspects of the present disclosure.
- Grapheme recognizer module 104 may translate an input grapheme image 301 to an output symbol.
- grapheme recognizer module 104 includes language type determiner (e.g., first level neural network model(s)) 320 , deep neural network (DNN) classifier (e.g., second level neural network model(s)) 330 , structural classifiers 310 , confidence score combiner 340 , and threshold evaluator 350 .
- One or more of components 320 - 350 , or a combination thereof, may be implemented by one or more software modules running on one or more hardware devices.
- Language type determiner (e.g., first level neural network model(s)) 320 may determine a group of grapheme symbols (e.g., a set of characters for a particular language group or alphabets) to search for the grapheme image 301 .
- the first level neural network model is used to classify the grapheme image 301 into one or more languages within a group of languages.
- the first level neural network model may be a convolutional neural network model trained to classify a grapheme image as one of group of languages.
- the language is specified by a user operating the text recognition system 100 , e.g., language input 303 .
- the language is specified by a previously identified language for a neighboring grapheme and/or fragment.
- Deep neural network (DNN) classifier e.g., the second level neural network model(s)
- DNN classifier may classify the grapheme image as a symbol of the chosen alphabet and associate a classification confidence score with the hypothesis associating the grapheme with the symbol.
- the DNN classifiers 330 may use a second level neural network model that is trained for the particular language for the classification.
- Structural classifiers 310 may classify the grapheme image to a symbol with a classification confidence score based on a rule-based classification system.
- Confidence score combiner 340 may combine the classification confidence scores (e.g., a modified classification confidence score) for the grapheme image based on a classification confidence score for a neural network classification and a classification confidence score for a structural classification.
- the combined classification confidence score is a linear combination (such as a weighted sum) of the classification confidence scores of the two classifications.
- the combined classification confidence score is an aggregate function (such as minimum, maximum, or average) of the classification confidence scores of the two classifications.
- Threshold evaluator 350 may evaluate the combine classification confidence score for a particular grapheme. If the combined classification confidence scores are below a predetermined threshold (e.g., glued graphemes, invalid graphemes, or graphemes that belong to a different language, etc.). The grapheme may be disregarded for further consideration.
- a predetermined threshold e.g., glued graphemes, invalid graphemes, or graphemes that belong to a different language, etc.
- DNN classifiers 330 may include feature extractor 331 that generates a feature vector corresponding to the input grapheme image 301 .
- DNN classifiers 330 may transform the feature vector into a vector of class weights, such that each weight would characterize the probability of the input image 301 to be a grapheme class of a set of classes (e.g., a set of alphabet characters/symbols A, B, C, etc.), where the grapheme class is identified by the index of the vector element within the vector of class weights.
- DNN classifiers 330 may than apply a normalized exponential function to transform the vector of class weights into a vector of probabilities, such that each probability would characterize a hypothesis of the input grapheme image 301 representing an instance of a certain grapheme class of a set of classes, where the grapheme class is identified by the index of the vector element within the vector of probabilities.
- the set of classes may be represented by a set of alphabet characters A, B, C, etc., and thus each probability of the set of probabilities produced by DNN classifiers 330 would characterize a hypothesis of the input image representing the corresponding character of the set of alphabet characters A, B, C, etc.
- DNN classifiers 330 may include distance confidence function 335 which computes distances, in the image feature space, between class center vectors 333 (class centers for a particular second level neural network model may be stored as part of class center vectors 163 in data store 160 of FIG.
- each classification confidence score (e.g., selected from the range of 0-1) reflects the level of confidence of the hypothesis of the input grapheme image 301 representing an instance of a certain class of the set of classes, where the grapheme class is identified by the index of the vector element within the vector of classification confidence scores.
- the set of classes may correspond to a set of alphabet characters (such as A, B, C, etc.), and thus the confidence function 335 may produce a set of classification confidence scores, such that each classification confidence score would characterize a hypothesis of the input image representing the corresponding character of the set of alphabet characters.
- the classification confidence score computed for each class (e.g., alphabet character) by confidence function 335 may be represented by the distance between the feature vector of the input image 301 and the center of the respective class.
- structural classifiers 310 may classify the grapheme image to a set of symbols and generate a classification confidence score for a corresponding symbol using a ruled-based classification system.
- Structural classifiers 310 may include a structural classifier for a corresponding class (symbol) of the set of classes (symbols).
- a structural classifier may analyze the structure of a grapheme image 301 by decomposing the grapheme image 301 into constituent components (e.g., calligraphic elements: lines, arcs, circles, and dots, etc.).
- the constituent components are then compared with predefined constituent components (e.g., calligraphic elements) for a particular one of the set of classes/symbols. If a particular constituent component exists, a combination, e.g., weighted sum, for the constituent component can be used to calculate a classification confidence score for the classification.
- the structural classifier includes a linear classifier. A linear classifier classifies a grapheme based on a linear combination of the weights of its constituent components.
- the structural classifier includes a bayesian binary classifier.
- each constituent component correspond to a binary value (i.e. zero or one) that contributes to the classification confidence score.
- a classification confidence score is generated for each class (e.g., alphabet character) of a set of classes.
- confidence score combiner 340 may combine the classification confidence scores for each class of the set of classes 344 for the DNN classifiers and the structural classifier to generate (combined) classification confidence score 342 for the set of classes 344 .
- the combined classification confidence score is a weighted sum of the classification confidence scores of the DNN classifiers and the structural classifier.
- the combined classification confidence score is a min, max, or average of the classification confidence scores of the two classifiers. The class with a highest combined classification confidence score and the corresponding combined classification confidence score may be selected as output of the grapheme recognizer module 104 .
- threshold evaluator 350 determines whether the combined classification confidence score is below a predetermined threshold. In response to determining the combined classification confidence score is below the predetermined threshold, grapheme recognizer module 104 may return an error code (or a nullified classification confidence score) indicating that the input image does not depict a valid grapheme (e.g., glued grapheme and/or a grapheme from a different language may be present in the input image).
- an error code or a nullified classification confidence score
- a first level neural network of language type determiner 320 may be implemented as a convolutional neural network having a structure schematically illustrated by FIG. 4 .
- the example convolutional neural network 400 may include a sequence of layers of different types, such as convolutional layers, pooling layers, rectified linear unit (ReLU) layers, and fully connected layers, each of which may perform a particular operation in recognizing a line of text in an input image. A layer's output may be fed as the input to one or more subsequent layers.
- convolutional neural network 400 may include an input layer 411 , one or more convolutional layers 413 A- 413 B, ReLU layers 415 A- 415 B, pooling layers 417 A- 417 B, and an output layer 419 .
- an input image (e.g., grapheme image 301 of FIG. 3 ) may be received by the input layer 411 and may be subsequently processed by a series of layers of convolutional neural network 400 .
- Each of the convolution layers may perform a convolution operation which may involve processing each pixel of an input fragment image by one or more filters (convolution matrices) and recording the result in a corresponding position of an output array.
- One or more convolution filters may be designed to detect a certain image feature, by processing the input image and yielding a corresponding feature map.
- the output of a convolutional layer may be fed to a ReLU layer (e.g., ReLU layer 415 A), which may apply a non-linear transformation (e.g., an activation function) to process the output of the convolutional layer.
- a ReLU layer e.g., ReLU layer 415 A
- the output of the ReLU layer 415 A may be fed to the pooling layer 417 A, which may perform a subsampling operation to decrease the resolution and the size of the feature map.
- the output of the pooling layer 417 A may be fed to the convolutional layer 413 B.
- the convolutional neural network 400 may include alternating convolutional layers and pooling layers. These alternating layers may enable creation of multiple feature maps of various sizes. Each of the feature maps may correspond to one of a plurality of input image features, which may be used for performing grapheme recognition.
- the penultimate layer (e.g., the pooling layer 417 B (a fully connected layer)) of the convolutional neural network 400 may produce a feature vector representative of the features of the original image, which may be regarded as a representation of the original image in the multi-dimensional space of image features.
- the feature vector may be fed to the fully-connected output layer 419 , which may generate a vector of class weights, such that each weight would characterize the degree of association of the input image with a grapheme class of a set of classes (e.g., a set of languages).
- the vector of class weights may then be transformed, e.g., by a normalized exponential function, into a vector of probabilities, such that each probability would characterize a hypothesis of the input grapheme image representing an instance of a certain grapheme class of a set of classes (e.g., English language).
- FIG. 4 illustrates a certain number of layers of the convolutional neural network 400
- convolutional neural networks employed in various alternative implementations may include any suitable numbers of convolutional layers, ReLU layers, pooling layers, batch normalization, dropout, and/or any other layers.
- the convolutional neural network 400 may be trained by forward and backward propagation using images from a training dataset, which includes images of graphemes and respective class identifiers (e.g., characters of particular languages) reflecting the correct classification of the images.
- FIG. 5 illustrates a block diagram of the second level neural network in accordance with one or more aspects of the present disclosure.
- the second level neural network (as part of DNN classifiers 330 of FIG. 3 ) may be implemented as a modified convolutional neural network 500 .
- the example modified convolutional neural network 500 may be similar to convolutional neural network 400 and may include a sequence of layers of different types, such as convolutional layers, pooling layers, rectified linear unit (ReLU) layers, and fully connected layers, each of which may perform a particular operation in recognizing the text in an input image.
- convolutional layers such as convolutional layers, pooling layers, rectified linear unit (ReLU) layers, and fully connected layers, each of which may perform a particular operation in recognizing the text in an input image.
- ReLU rectified linear unit
- the output of penultimate (or fully connected) pooling layer 417 B may be viewed as performing the functions of a feature extractor, e.g., feature extractor 331 of FIG. 3 .
- network 500 includes a concatenation layer 518 which concatenates geometric features input 510 of the input image 301 with the output features of pooling layer 417 B.
- the geometric features of the input image 301 may include geometric features of the grapheme in the grapheme image.
- Such geometric features may include: ratio of pixels of a width of the grapheme image to a height of the grapheme image, a ratio of a base line (bottom) of a symbol in the grapheme image to the height of the grapheme image, and a ratio of height (top) of symbol in the grapheme image to the height of the grapheme image.
- the geometric features may be used to distinguish between commas and apostrophes.
- Output layer 520 may correspond to a vector of class weights, where each weight would characterize the degree of association of the input image with a grapheme class of a set of classes (e.g., a set of alphabet characters A, B, C, etc.).
- the vector of class weights may then be transformed, e.g., by a normalized exponential function, into a vector of probabilities, such that each probability would characterize a hypothesis of the input grapheme image representing an instance of a certain grapheme class of a set of classes.
- the vectors of class weights and/or probabilities produced by fully-connected output layer 520 may be used in network training or inferencing.
- the feature vector produced by the penultimate layer (e.g., the pooling layer 417 B) of the convolutional neural network 500 may be fed to the above-described confidence function 335 , which produces a vector of classification confidence scores, such that each classification confidence score (e.g., selected from the range of 0-1) reflects the level of confidence of the hypothesis of the input grapheme image representing an instance of a certain class of the set of classes.
- the classification confidence score computed for each class of the set of classes by the confidence function may be represented by the distance between the feature vector of the input image and the center of the respective class.
- the convolutional neural network 500 may be trained by forward and backward propagation based on a loss function and images from a training dataset, which includes images of graphemes and respective class identifiers (e.g., characters of alphabets, A, B, C, . . . ) reflecting the correct classification of the images.
- a value of the loss function may be computed (forward propagation) based on the observed output of the convolutional neural network (i.e., the vector of probabilities) and the desired output value specified by the training dataset (e.g., the grapheme which is in fact shown by the input image, or, in other words the correct class identifier).
- the difference between the two values are backward propagated to adjust weights in the layers of convolutional neural network 500 .
- the loss function may be represented by the Cross Entropy Loss (CEL), which may be expressed as follows:
- the summing is performed by all input images from the current batch of input images.
- the identified classification error is propagated back to the previous layers of the convolutional neural network, in which the network parameters are adjusted accordingly. This process may be repeated until the value of the loss function would stabilize in the vicinity of a certain value or fall below a predetermined threshold.
- the neural network trained using the CEL loss function would place the instances of the same class along a certain vector in the feature space, thus facilitating efficient segregation of instances of different classes.
- CEL loss function may be adequate for distinguishing images of different graphemes, it may not always produce satisfactory results in filtering out invalid image graphemes.
- the Center Loss (CL) function may be employed in addition to the CEL function, thus compacting the representation of each class in the feature space, such that all instances of a given class would be located within a relatively small vicinity of a certain point, which would thus become the class center, while any feature representation of an invalid grapheme image would be located relatively further away (e.g., at a distance exceeding a pre-defined or dynamically configured threshold) from any class center.
- the Center Loss function may be expressed as follows:
- the summing is performed by all input images from the current batch of input images.
- the center class vectors C j may be computed as the average of all features of the images which belong to the j-th class. As schematically illustrated by FIG. 3 , the computed center class vectors 333 may be stored in the memory accessible by the grapheme recognizer module 104 .
- DNN classifiers 330 may be trained using a loss function represented by a linear combination of the CEL and CL functions, assuming zeroes as the initial values of the center class vectors.
- the values of the center class vectors may be re-computed after processing each training dataset (i.e., each batch of input images).
- DNN classifiers 330 may initially be trained using the CEL function, and initial values of the center class vectors may be computed after completing the initial training stage.
- the subsequent training may utilize a linear combination of the CEL and CL functions, and the values of the center class vectors may be re-computed after processing each training dataset (i.e., each batch of input images).
- the loss function L may be represented by a linear combination of CEL and CL functions, which may be expressed as follows:
- ⁇ is a weight coefficient the value of which may be adjusted to throttle the CL impact on the resulting loss function value, thus avoiding over-narrowing the feature range for instances of a given class.
- the confidence function may be designed to ensure that the grapheme recognizer would assign low classification confidence scores to invalid grapheme images. Accordingly, the confidence of associating a given image with a certain class (e.g., recognizing a certain grapheme in the image) would thus reflect the distance between the feature vector of the image and the center of the class, which may be expressed as follows:
- d k is the distance between the center C k of k-th class and the feature vector F of a given image.
- distance confidence function 335 may be represented by a monotonically decreasing function of the distance between the class center and the feature vector of an input image in the space of image features. Thus, the further the feature vector is located from the class center, the less would be the classification confidence score assigned to associating the input image with this class.
- distance confidence function 335 may be provided by a piecewise-linear function of the distance.
- the distance confidence function 335 may be performed by selecting certain classification confidence scores q i and determining the corresponding distance values d i that would minimize the number of classification errors produced by the classifier processing a chosen validation dataset (which may be represented, e.g., by a set of document images (e.g., images of document pages) with associated metadata specifying the correct classification of the graphemes in the image).
- the classification confidence scores q i may be chosen at equal intervals within the valid range of classification confidence scores (e.g., 0-1).
- the intervals between the classification confidence scores q i may be chosen to increase while moving along the classification confidence score range towards to lowest classification confidence score, such that the intervals would be lower within a certain high classification confidence score range, while being higher within a certain low classification confidence score range.
- FIG. 6 schematically illustrates an example confidence function Q(d) implemented in accordance with one or more aspects of the present disclosure.
- the classification confidence scores q k may be chosen at pre-selected intervals within the valid range of classification confidence scores (e.g., 0-1), and then the corresponding values d k may be determined. If higher sensitivity of the function to its inputs in the higher range of function values is desired, the q k values within a certain high classification confidence score range may be selected at relatively small intervals (e.g., 1; 0.98; 0.95; 0.9; 0.85; 0.8; 0.7; 0.6; . . . ).
- the distances ⁇ k between neighboring d k values may then be determined by applying optimization methods, such as the differential evolution method.
- the confidence function Q(d) may then be constructed as a piecewise linear function connecting the computed points (d k , q k ).
- the classification confidence scores may only be determined for a subset of the classification hypotheses which the classifier has associated with high probabilities (e.g., exceeding a certain threshold).
- loss and confidence functions ensures that, for the majority of invalid grapheme images, low classification confidence scores would be assigned to hypotheses associating the input images with all possible graphemes.
- a clear advantage of applying the above-described loss and confidence functions is training the classifier without requiring the presence of negative samples in the training dataset, since, as noted herein above, all possible variations of invalid images may be difficult to produce, and the number of such variations may significantly exceed the number of valid graphemes.
- the DNN classifiers 330 trained using the above-described loss and confidence functions may still fail to filter out a small number of invalid grapheme images. For example, a hypothesis associating an invalid grapheme image with a certain class (i.e., erroneously recognizing a certain grapheme within the image) would receive a high classification confidence score if the feature vector of the invalid grapheme image is located within a relatively small vicinity of a center of the class. While the number of such errors tends to be relatively small, the above-described loss function may be enhanced in order to filter out such invalid grapheme images.
- the above-described loss function represented by a linear combination of the CEL function and the CL function may be enhanced by introducing a third term, referred herein as Close-to-Center Penalty Loss (CCPL) function, which would cause the feature vectors of known types of invalid images be removed from the centers of all classes.
- CCPL Close-to-Center Penalty Loss
- Training the second level neural network using the loss function with the CCPL may involve iteratively processing batches of input images, such that each batch includes positive samples (images of valid graphemes) and negative samples (invalid grapheme images).
- CEL+ ⁇ *CL term may be only computed for positive samples, while ⁇ *CCPL term may be only computed for negative samples.
- the training dataset may include the negative samples represented by real invalid grapheme images which were erroneously classified as valid images and assigned classification confidence scores exceeding a certain pre-determined threshold.
- the training dataset may include the negative samples represented by synthetic invalid grapheme images.
- the CCPL function which is computed for negative training samples, may be expressed as follows:
- CCPL ⁇ i ⁇ j max ⁇ ( 0 ; A - ⁇ F j neg - C i ⁇ 1 )
- A is a pre-defined or adjustable parameter defining the size of the neighborhood of the class center (i.e., the distance to the class center) in the space of image features, such that the feature vectors located within the neighborhood are penalized, while the penalty would not be applied to the feature vectors located outside of the neighborhood.
- the value of the CCPL function is incremented by that distance.
- Training the DNN classifiers involves minimizing the CCPL value. Accordingly, the trained DNN classifiers would, for an invalid grapheme image, yield a feature vector which is located outside of immediate vicinities of the centers of valid classes. In other words, the classifier is trained to distinguish between valid graphemes and invalid grapheme images.
- a particular second level neural network 500 (as part of DNN classifiers 330 ) is trained for a particular language group to generate a second level neural network model for that particular language (e.g., set of symbols), where a corresponding second level neural network model is selected corresponding to the particular trained language when the language is specified.
- the second level neural network models trained by the methods described herein may be utilized for performing various image classification tasks, including but not limited to the text recognition.
- a particular second level neural network model (as part of DNN classifiers 330 ) may be used to classify a grapheme image to a set of classes and to generate classification confidence scores for the set of classes.
- Combiner 340 may then combine the classification confidence scores with classification confidence scores of a structural classified.
- Evaluator 350 may evaluator (or discard) for a best class of grapheme using the combined classification confidence score.
- Grapheme recognizer module 104 may iterate through the grapheme images of a hypothesis and generate a (combined) classification confidence score for each grapheme image.
- confidence score module 105 may combine the classification confidence scores of all the grapheme image within a hypothesis to generate a fragmentation confidence score for the particular hypothesis for the graphemes divisions. Among several hypotheses for the grapheme divisions, a hypothesis with the highest fragmentation confidence score is selected as the final hypothesis (denoted as the fragmentation confidence score for the fragment). In one embodiment, the fragmentation confidence scores for each fragment may be combined to determine a final confidence score for a hypothesis, the hypothesis to divide the image with the line of text into its fragment image constituents.
- spacing and translation module 106 then translates the final hypothesis for the fragments division to output symbols 108 .
- the spacing/punctuation adjustment may be applied to the output symbols 108 where necessary using a rule-based algorithm. For example, a period follows the end of a sentence, double spacing is applied in between sentences, etc.
- correction module 107 may further adjust the symbols 108 taking into account a context for the output symbols.
- a third level deep (convolutional) neural network model may classify the symbols 108 into a number, name, address, email address, date, language, etc. to give context to the symbols 108 .
- the symbols 108 (or words within the symbols) with the context may be compared with a dictionary, a morphological model, a syntactic model, etc.
- an ambiguous word words a confidence below a certain threshold
- the ambiguous word may be corrected in response to a match.
- FIGS. 7 - 8 are flow diagrams of various implementations of methods related to text recognition.
- the methods are performed by processing logic that may include hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both.
- the methods and/or each of their individual functions, routines, subroutines, or operations may be performed by one or more processors of a computing device (e.g., computing system 800 of FIG. 9 ) implementing the methods.
- the methods may be performed by a single processing thread.
- the methods may be performed by two or more processing threads, each thread implementing one or more individual functions, routines, subroutines, or operations of the methods.
- Method 700 may be performed by text recognition system 100 of FIG. 1 A
- method 750 may be performed by text recognizer module 104 of FIG. 1 A .
- FIG. 7 depicts a flow diagram of a method to recognize a line of text in accordance with one or more aspects of the present disclosure.
- processing logic receives an image with a line of text.
- processing logic segments the image into two or more fragment images.
- processing logic determines a first hypothesis to segment the fragment image into a first plurality of grapheme images.
- processing logic determines a first fragmentation confidence score for the first hypothesis.
- processing logic determines a second hypothesis to segment the fragment image into a second plurality of grapheme images.
- processing logic determines a second fragmentation confidence score for the second hypothesis.
- processing logic determines that the first fragmentation confidence score is greater than the second fragmentation confidence score.
- processing logic translates the first plurality of grapheme images defined by a first hypothesis to a plurality of symbols.
- processing logic assembles the plurality of symbols of each fragment image to derive the line of text.
- determining the first fragmentation confidence score further comprises: applying a first level neural network model to one of the first plurality of grapheme images to determine a language grouping for the grapheme image.
- processing logic further selects a second level neural network model based on the language grouping, applies the second level neural network model to the grapheme image to determine a classification confidence score for the grapheme image, and determining the first fragmentation confidence score based on classification confidence scores for each of the first plurality of grapheme images.
- a particular second level neural network model corresponds to a particular grouping of graphemes to be recognized, wherein the particular language grouping corresponds to a group of symbols of a particular language.
- processing logic further applies a structural classifier to the grapheme image, determines a modified classification confidence score for the grapheme image based on the structural classification, and determines the first fragmentation confidence score based on modified classification confidence scores for each of the first plurality of grapheme images. In one embodiment, processing logic further determines whether the first fragmentation confidence score is greater than a predetermined threshold, and responsive to determining the first fragmentation confidence score is greater than the predetermined threshold, processing logic translates the first plurality of grapheme images of the first hypothesis to a plurality of symbols.
- a second level neural network model is trained using a loss function, wherein the loss function is represented by a combination of a cross entropy loss function, a center loss function, and/or a close-to-center penalty loss function.
- the second level neural network model includes a first input for a grapheme image, a second input for geometric features of the grapheme image, and a concatenate layer to concatenate the geometric features of the grapheme image to an inner layer of the second level neural network model.
- processing logic further verifies the plurality of symbols with a morphological, a dictionary, or a syntactical model.
- FIG. 8 depicts a flow diagram of a method to recognize a grapheme in accordance with one or more aspects of the present disclosure.
- processing logic iterates through a plurality of grapheme images of a hypothesis.
- processing logic recognizes a grapheme within a grapheme image and obtains a first classification confidence score for the grapheme image.
- processing logic verifies the recognized grapheme with a structural classifier and obtains a second classification confidence score.
- processing logic obtains a (combined) classification confidence score for the recognized graphemes based on the first and the second classification confidence scores.
- processing logic determines whether the classification confidence score is above a predetermined threshold.
- processing logic outputs the recognized graphemes and classification confidence scores for the plurality of grapheme images.
- recognizing a grapheme further comprising recognizing the grapheme using a second level convolutional neural network model.
- processing logic assembles the recognized graphemes to derive a line of text.
- FIG. 9 depicts an example computer system 800 which may perform any one or more of the methods described herein.
- computer system 800 may correspond to a computing device capable of executing text recognition component 110 of FIG. 1 A .
- the computer system may be connected (e.g., networked) to other computer systems in a LAN, an intranet, an extranet, or the Internet.
- the computer system may operate in the capacity of a server in a client-server network environment.
- the computer system may be a personal computer (PC), a tablet computer, a set-top box (STB), a personal Digital Assistant (PDA), a mobile phone, a camera, a video camera, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device.
- PC personal computer
- PDA personal Digital Assistant
- STB set-top box
- mobile phone a camera
- video camera or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by
- the exemplary computer system 800 includes a processing device 802 , a main memory 804 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM)), a static memory 806 (e.g., flash memory, static random access memory (SRAM)), and a data storage device 816 , which communicate with each other via a bus 808 .
- main memory 804 e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM)
- DRAM dynamic random access memory
- SDRAM synchronous DRAM
- static memory 806 e.g., flash memory, static random access memory (SRAM)
- SRAM static random access memory
- Processing device 802 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, the processing device 802 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets.
- the processing device 802 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like.
- the processing device 802 is configured to execute instructions 826 for performing the operations and steps discussed herein.
- the computer system 800 may further include a network interface device 822 .
- the computer system 800 also may include a video display unit 810 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 812 (e.g., a keyboard), a cursor control device 814 (e.g., a mouse), and a signal generation device 820 (e.g., a speaker).
- a video display unit 810 e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)
- an alphanumeric input device 812 e.g., a keyboard
- a cursor control device 814 e.g., a mouse
- a signal generation device 820 e.g., a speaker
- the video display unit 810 , the alphanumeric input device 812 , and the cursor control device 814 may be combined into a single component or device (e.
- the data storage device 816 may include a computer-readable medium 824 on which is stored instructions 826 (e.g., corresponding to the methods of FIGS. 7 - 8 , etc.) embodying any one or more of the methodologies or functions described herein. Instructions 826 may also reside, completely or at least partially, within the main memory 804 and/or within the processing device 802 during execution thereof by the computer system 800 , the main memory 804 and the processing device 802 also constituting computer-readable media. Instructions 826 may further be transmitted or received over a network via the network interface device 822 .
- While the computer-readable storage medium 824 is shown in the illustrative examples to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
- the term “computer-readable storage medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure.
- the term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media.
- the present disclosure also relates to an apparatus for performing the operations herein.
- This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer.
- a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
- a machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer).
- a machine-readable (e.g., computer-readable) medium includes a machine (e.g., a computer) readable storage medium (e.g., read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory devices, etc.).
- any aspect or design described herein as “example” or “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the words “example” or “exemplary” is intended to present concepts in a concrete fashion.
- the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X includes A or B” is intended to mean any of the natural inclusive permutations. That is, if X includes A; X includes B; or X includes both A and B, then “X includes A or B” is satisfied under any of the foregoing instances.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Character Discrimination (AREA)
- Image Analysis (AREA)
Abstract
In one embodiment, a system receives an image depicting a line of text. The system segments the image into two or more fragment images. For each of the two or more fragment images, the system determines a first hypothesis to segment the fragment image into a first plurality of grapheme images and a first fragmentation confidence score. The system determines a second hypothesis to segment the fragment image into a second plurality of grapheme images and a second fragmentation confidence score. The system determines that the first fragmentation confidence score is greater than the second fragmentation confidence score. The system translates the first plurality of grapheme images defined by the first hypothesis to symbols. The system assembles the symbols of each fragment image to derive the line of text.
Description
- This application is a continuation application of U.S. application Ser. No. 17/107,256, filed Nov. 30, 2020, which claims the benefit of priority under 35 USC § 119 to Russian patent application No. 2020138488, filed Nov. 24, 2020. Both above-referenced applications are incorporated by reference herein.
- The present disclosure is generally related to computer systems, and is more specifically related to systems and methods for recognition of handwritten text via neural networks.
- It may be difficult to formalize an algorithm to recognize handwritings since the shape, size and, consistency of hand printed characters may vary, even if a person writes in block letters. Furthermore, the same letters printed by different people, and the same letters printed by the same person at different times or under different circumstances may appear completely different.
- Embodiments of the present disclosure describe a system and method to recognize handwritten text (including hieroglyphic symbols) using deep neural network models. In one embodiment, a system receives an image depicting a line of text. The system segments the image into two or more fragment images. For each of the two or more fragment images, the system determines a first hypothesis to segment the fragment image into a first plurality of grapheme images and a first fragmentation confidence score. The system determines a second hypothesis to segment the fragment image into a second plurality of grapheme images and a second fragmentation confidence score. The system determines that the first fragmentation confidence score is greater than the second fragmentation confidence score. The system translates the first plurality of grapheme images defined by the first hypothesis to symbols. The system assembles the symbols of each fragment image to derive the line of text.
- The present disclosure is illustrated by way of example, and not by way of limitation, and may be more fully understood with reference to the following detailed description when considered in connection with the Figures in which:
-
FIGS. 1A-1B depict high level system diagrams of an example text recognition system in accordance with one or more aspects of the present disclosure. -
FIG. 2 illustrates an example of an image, fragment images, and grapheme images in accordance with one or more aspects of the present disclosure. -
FIG. 3 illustrates a block diagram of a grapheme recognizer module in accordance with one or more aspects of the present disclosure. -
FIG. 4 illustrates a block diagram of a first level neural network in accordance with one or more aspects of the present disclosure. -
FIG. 5 illustrates a block diagram of a second level neural network in accordance with one or more aspects of the present disclosure. -
FIG. 6 schematically illustrates an example confidence function Q(d) implemented in accordance with one or more aspects of the present disclosure. -
FIG. 7 depicts a flow diagram of a method to recognize a line of text in accordance with one or more aspects of the present disclosure. -
FIG. 8 depicts a flow diagram of a method to recognize a grapheme in accordance with one or more aspects of the present disclosure. -
FIG. 9 depicts a block diagram of an illustrative computer system in accordance with one or more aspects of the present disclosure. - Optical character recognition may involve matching a given grapheme image with a list of candidate symbols, followed by determining the probability associated with each candidate symbol. The higher the probability of match, the higher the likelihood that the candidate symbol is the correct symbol. Two problems, however, arise in single character recognition: ambiguity in the individual character and incorrect image segmentation. The first problem arises because the same image in a different context may correspond to different characters, e.g., characters for different languages, number, name, date, email, etc. The second problem arises if the grapheme image does not contain a valid symbol, i.e., the grapheme image does not contain an entire symbol or the grapheme image represents two symbols glued together.
- Described herein are methods and systems for handwritten text recognition. The text recognition process may extract computer-readable and searchable textual information from indicia-bearing images of various media (such as hand printed paper documents, banners, posters, signs, billboards, and/or other physical objects bearing visible graphemes on one or more of their surfaces). “Grapheme” herein shall refer to the elementary unit of a writing system of a given language. A grapheme may be represented, e.g., by a logogram representing a word or a morpheme, a syllabic character representing a syllable, or an alphabetic character representing a phoneme. “Fragment” herein shall refer to a paragraph, a sentence, a title, a part of a sentence, a word combination, for example, a noun group, etc. “Handwritten text” or “handwritten characters” is broadly understood to include any characters, including cursive and print characters, that were produced by hand using any reasonable writing instrument (such as pencil, pen, etc.) on any suitable substrate (such as paper) and further includes characters that were generated by a computer in accordance with user interface input received from a pointing device (such as stylus). In an illustrative example, a string of handwritten characters may include visual gaps between individual characters (graphemes). In another illustrative example, a string of handwritten characters may include one or more conjoined characters with no visual gaps between individual characters (graphemes).
- In accordance with aspects of the present disclosure, the text recognition process involves a segmentation stage and a stage of recognizing individual characters. Segmentation involves dividing an image into fragment images and then into grapheme images that contain respective individual characters. Different variants of segmentation or hypotheses may be generated and/or evaluated and the best hypothesis may then be selected based on some predetermined criteria.
- For individual character recognition, there exists the problems of ambiguity (e.g., the same image in different contexts may correspond to different characters) and incorrectly segmented images may be used as the input for the individual character recognition. The problems in ambiguity and incorrect segmentation may be solved by verification of the character (or grapheme image) at a higher (e.g., image fragment) level. Furthermore, a classification confidence score may be provided for each individual character and a recognized character with a classification confidence score below a predetermined threshold may be discarded to improve the character recognition. The classification confidence score may be generated by a combination of a structural classifier and a neural network classifier, as further described below. The neural network classifier may be trained with positive and negative (invalid/defective) image samples with a loss function that is a combination of center loss, cross entropy loss, and close-to-center penalty loss. Negative image samples may be used as an additional class in neural network classifier.
- The combined approach described herein represent significant improvements over various common methods, by employing hypotheses of segmentation and generation of confidence scores for the hypotheses using loss functions. The loss functions can specifically be aimed at training the neural network to recognize valid and invalid or defective grapheme images, thus improving the overall quality and efficiency of optical character recognition. Furthermore, the methods of neural network-based optical character recognition using specialized confidence functions described herein represent significant improvements over various common methods, by employing a confidence function which computes the distances, in the image feature space, between the feature vector representing the input image and vectors representing centers of classes of a set of classes, and transforms the computed distances into a vector of confidence scores, such that each confidence score (e.g., selected from the range of 0-1) reflects the level of confidence of the hypothesis of the input grapheme image representing an instance of a certain class of the set of grapheme classes, as described in more detail herein below.
- Various aspects of the above referenced methods and systems are described in details herein below by way of examples, rather than by way of limitation.
-
FIGS. 1A-1B depict high level system diagrams of example text recognition systems in accordance with one or more aspects of the present disclosure. Thetext recognition system 100 may include atext recognition component 110 that may perform optical character recognition for handwritten text.Text recognition component 110 may be a client-based application or may be a combination of a client component and a server component. In some implementations,text recognition component 110 may execute entirely on the client computing device such as a tablet computer, a smart phone, a notebook computer, a camera, a video camera, or the like. Alternatively, a client component oftext recognition component 110 executing on a client computing device may receive a document image and transmit it to a server component oftext recognition component 110 executing on a server device that performs the text recognition. The server component oftext recognition component 110 may then return a list of symbols to the client component oftext recognition component 110 executing on the client computing device for storage or to provide to another application. In other implementations,text recognition component 110 may execute on a server device as an Internet-enabled application accessible via a browser interface. The server device may be represented by one or more computer systems such as one or more server machines, workstations, mainframe machines, personal computers (PCs), etc. - Referring to
FIG. 1A ,text recognition component 110 may include, but is not limited to,image receiver 101,image segmenter module 102,hypotheses module 103,grapheme recognizer module 104,confidence score module 105, spacing andtranslation module 106, andcorrection module 107. One or more of components 101-107, or a combination thereof, may be implemented by one or more software modules running on one or more hardware devices. -
Image receiver 101 may receive an image from various sources such as a camera, a computing device, a server, a handheld device. The image may be a document file, or a picture file with visible text. The image may be derived from various media (such as hand printed paper documents, banners, posters, signs, billboards, and/or other physical objects with visible texts on one or more of their surfaces. In one embodiment, the image may be pre-processed by applying one or more image transformation to the received image, e.g., binarization, size scaling, cropping, color conversions, etc. to prepare the image for text recognition. -
Image segmenter module 102 may segment the received image into fragment images and then into grapheme images. The fragment images or grapheme images may be segmented by visual spacings or gaps in a line of text in the image.Hypotheses module 103 may generate different variants or hypotheses to segment a line of text into fragment image constituents, and a fragment image into one or more grapheme constituents.Grapheme recognizer module 104 may perform recognition for a grapheme image using neural network models 162 (as part of data store 160).Confidence score module 105 may determine a confidence score for a hypothesis based on the recognized grapheme. The confidence score may be a fragmentation confidence score (confidence for a fragment/word) or a classification confidence score (confidence for a symbol/character). Spacing andtranslation module 106 may add spaces (if necessary) to the grapheme and translate the grapheme into character symbols. After the fragments in the text are recognized,correction module 107 may correct certain parts of the text taken into account a context of the text. The correction may be performed by verifying the character symbols with dictionary/morphological/syntactic rules 161. Although modules 101-107 are shown separately, some of modules 101-107, or functionalities thereof, may be combined together. -
FIG. 1B depicts a high level system diagram of interactions of the modules 101-107 according to one embodiment. For example,image receiver 101 may receiveimage 109. The outputs ofimage receiver 101 are input to subsequent modules 102-107 to generatesymbols 108. Note that, modules 101-107 may be processed sequentially or in parallel. For example, one instance ofimage segmenter module 102 may processimage 109, while several instances ofgrapheme recognizer module 104 may process several graphemes in parallel, e.g., a number of hypotheses for fragments and/or graphemes may be processed in parallel. -
FIG. 2 illustrates an example of an image, fragment images, and grapheme images in accordance with one or more aspects of the present disclosure. Referring toFIGS. 1-2 ,image 109 may be a binarized image with a line of text “This sample” in the English language. In one embodiment,segmentation module 102 ofFIGS. 1A- 1 B segments image 109 into hypotheses (or variants) 201A-C using a linear division graph. A linear division graph is a graph with markings of all division points. I.e., if there are several ways to divide a string into words, and words into letters, all possible division points are marked and a piece of the image between two marked points is considered a candidate letter (or word). Each ofhypotheses 201 can be a variation of how many division points there are and/or where to place these division points to divide the line of text into one or more fragment images. Here,hypothesis 201A divides the line of text into twofragment images 203A-B, e.g., “This” and “sample”. The segmentation may be performed based on visible vertical spacings/gaps in the line of text. - The
segmentation module 102 may further segment afragment images 203B into hypotheses (or variants) 205A-C, wherehypothesis 205A is a single variation to dividefragment image 203B into a plurality ofgrapheme images 207. Here, the segmentation points (or division points) may be determined based on gaps in the fragment image. In some cases, the gaps are vertical gaps or slanted gaps (slanted at a particular angle of the handwriting) in the fragment image. In another embodiment, the division points may be determined based on areas of low pixel distributions of a vertical (or slanted) pixel profile of the fragment image. In some other embodiments machine learning may be used for segmentation. - In another embodiment, a hypothesis for division of graphemes may be evaluated for a fragmentation confidence score and the fragment that defined the hypothesis with the highest fragmentation confidence score is picked as the final fragment. In one embodiment, constituents of the line of text (e.g., fragments) or constituents the fragment images (e.g., graphemes) may be stored in a linear division graph data structure. Different paths of the linear division graph data structure may then be used to enumerate one or more hypotheses (or variations or combinations) of fragments/graphemes based on the segmentation/division points.
- Referring to
FIG. 2 , in a scenario,hypotheses module 103 may generate threehypotheses 201A-C for the fragment divisions. Forfragment 203B, e.g., “sample”,hypotheses module 103 generatehypotheses 205A-C, e.g., “s-a-m-p-l-e”, “s-a-mp-l-e”, and “s-a-m-p-le”, where ‘-’ represents the identified visual gaps or potential dividing gaps between the one or more graphemes. As depicted inFIG. 2 , in this example,hypotheses grapheme recognition module 104 as described further below. -
FIG. 3 illustrates a block diagram of a grapheme recognizer module in accordance with one or more aspects of the present disclosure.Grapheme recognizer module 104 may translate aninput grapheme image 301 to an output symbol. In one embodiment,grapheme recognizer module 104 includes language type determiner (e.g., first level neural network model(s)) 320, deep neural network (DNN) classifier (e.g., second level neural network model(s)) 330,structural classifiers 310,confidence score combiner 340, andthreshold evaluator 350. One or more of components 320-350, or a combination thereof, may be implemented by one or more software modules running on one or more hardware devices. - Language type determiner (e.g., first level neural network model(s)) 320 may determine a group of grapheme symbols (e.g., a set of characters for a particular language group or alphabets) to search for the
grapheme image 301. In one embodiment, the first level neural network model is used to classify thegrapheme image 301 into one or more languages within a group of languages. The first level neural network model may be a convolutional neural network model trained to classify a grapheme image as one of group of languages. In another embodiment, the language is specified by a user operating thetext recognition system 100, e.g.,language input 303. In another embodiment, the language is specified by a previously identified language for a neighboring grapheme and/or fragment. Deep neural network (DNN) classifier (e.g., the second level neural network model(s)) 330 may classify the grapheme image as a symbol of the chosen alphabet and associate a classification confidence score with the hypothesis associating the grapheme with the symbol. TheDNN classifiers 330 may use a second level neural network model that is trained for the particular language for the classification.Structural classifiers 310 may classify the grapheme image to a symbol with a classification confidence score based on a rule-based classification system.Confidence score combiner 340 may combine the classification confidence scores (e.g., a modified classification confidence score) for the grapheme image based on a classification confidence score for a neural network classification and a classification confidence score for a structural classification. In one embodiment, the combined classification confidence score is a linear combination (such as a weighted sum) of the classification confidence scores of the two classifications. In another embodiment, the combined classification confidence score is an aggregate function (such as minimum, maximum, or average) of the classification confidence scores of the two classifications.Threshold evaluator 350 may evaluate the combine classification confidence score for a particular grapheme. If the combined classification confidence scores are below a predetermined threshold (e.g., glued graphemes, invalid graphemes, or graphemes that belong to a different language, etc.). The grapheme may be disregarded for further consideration. - Referring to
FIG. 3 , in one embodiment,DNN classifiers 330 may includefeature extractor 331 that generates a feature vector corresponding to theinput grapheme image 301.DNN classifiers 330 may transform the feature vector into a vector of class weights, such that each weight would characterize the probability of theinput image 301 to be a grapheme class of a set of classes (e.g., a set of alphabet characters/symbols A, B, C, etc.), where the grapheme class is identified by the index of the vector element within the vector of class weights.DNN classifiers 330 may than apply a normalized exponential function to transform the vector of class weights into a vector of probabilities, such that each probability would characterize a hypothesis of theinput grapheme image 301 representing an instance of a certain grapheme class of a set of classes, where the grapheme class is identified by the index of the vector element within the vector of probabilities. In an illustrative example, the set of classes may be represented by a set of alphabet characters A, B, C, etc., and thus each probability of the set of probabilities produced byDNN classifiers 330 would characterize a hypothesis of the input image representing the corresponding character of the set of alphabet characters A, B, C, etc. - However, as noted herein above, such probabilities may be unreliable, e.g., for glued graphemes, invalid graphemes, graphemes that belong to a different language, etc. To alleviate such unreliability,
DNN classifiers 330 may includedistance confidence function 335 which computes distances, in the image feature space, between class center vectors 333 (class centers for a particular second level neural network model may be stored as part ofclass center vectors 163 indata store 160 ofFIG. 1A ) and the feature vector of theinput image 301, and transforms the computed distances into a vector of classification confidence scores, such that each classification confidence score (e.g., selected from the range of 0-1) reflects the level of confidence of the hypothesis of theinput grapheme image 301 representing an instance of a certain class of the set of classes, where the grapheme class is identified by the index of the vector element within the vector of classification confidence scores. In an illustrative example, the set of classes may correspond to a set of alphabet characters (such as A, B, C, etc.), and thus theconfidence function 335 may produce a set of classification confidence scores, such that each classification confidence score would characterize a hypothesis of the input image representing the corresponding character of the set of alphabet characters. In one embodiment, the classification confidence score computed for each class (e.g., alphabet character) byconfidence function 335 may be represented by the distance between the feature vector of theinput image 301 and the center of the respective class. - Referring to
FIG. 3 ,structural classifiers 310 may classify the grapheme image to a set of symbols and generate a classification confidence score for a corresponding symbol using a ruled-based classification system.Structural classifiers 310 may include a structural classifier for a corresponding class (symbol) of the set of classes (symbols). A structural classifier may analyze the structure of agrapheme image 301 by decomposing thegrapheme image 301 into constituent components (e.g., calligraphic elements: lines, arcs, circles, and dots, etc.). The constituent components (or calligraphic elements) are then compared with predefined constituent components (e.g., calligraphic elements) for a particular one of the set of classes/symbols. If a particular constituent component exists, a combination, e.g., weighted sum, for the constituent component can be used to calculate a classification confidence score for the classification. In one embodiment, the structural classifier includes a linear classifier. A linear classifier classifies a grapheme based on a linear combination of the weights of its constituent components. In another embodiment, the structural classifier includes a bayesian binary classifier. Here, each constituent component correspond to a binary value (i.e. zero or one) that contributes to the classification confidence score. Based on the structural classification, a classification confidence score is generated for each class (e.g., alphabet character) of a set of classes. - Next,
confidence score combiner 340 may combine the classification confidence scores for each class of the set ofclasses 344 for the DNN classifiers and the structural classifier to generate (combined)classification confidence score 342 for the set ofclasses 344. In one embodiment, the combined classification confidence score is a weighted sum of the classification confidence scores of the DNN classifiers and the structural classifier. In another embodiment, the combined classification confidence score is a min, max, or average of the classification confidence scores of the two classifiers. The class with a highest combined classification confidence score and the corresponding combined classification confidence score may be selected as output of thegrapheme recognizer module 104. - In one embodiment,
threshold evaluator 350 determines whether the combined classification confidence score is below a predetermined threshold. In response to determining the combined classification confidence score is below the predetermined threshold,grapheme recognizer module 104 may return an error code (or a nullified classification confidence score) indicating that the input image does not depict a valid grapheme (e.g., glued grapheme and/or a grapheme from a different language may be present in the input image). - In an illustrative example, a first level neural network of
language type determiner 320 may be implemented as a convolutional neural network having a structure schematically illustrated byFIG. 4 . The example convolutionalneural network 400 may include a sequence of layers of different types, such as convolutional layers, pooling layers, rectified linear unit (ReLU) layers, and fully connected layers, each of which may perform a particular operation in recognizing a line of text in an input image. A layer's output may be fed as the input to one or more subsequent layers. As illustrated, convolutionalneural network 400 may include aninput layer 411, one or moreconvolutional layers 413A-413B, ReLU layers 415A-415B, pooling layers 417A-417B, and anoutput layer 419. - In some embodiments, an input image (e.g.,
grapheme image 301 ofFIG. 3 ) may be received by theinput layer 411 and may be subsequently processed by a series of layers of convolutionalneural network 400. Each of the convolution layers may perform a convolution operation which may involve processing each pixel of an input fragment image by one or more filters (convolution matrices) and recording the result in a corresponding position of an output array. One or more convolution filters may be designed to detect a certain image feature, by processing the input image and yielding a corresponding feature map. - The output of a convolutional layer (e.g.,
convolutional layer 413A) may be fed to a ReLU layer (e.g.,ReLU layer 415A), which may apply a non-linear transformation (e.g., an activation function) to process the output of the convolutional layer. The output of theReLU layer 415A may be fed to thepooling layer 417A, which may perform a subsampling operation to decrease the resolution and the size of the feature map. The output of thepooling layer 417A may be fed to theconvolutional layer 413B. - Processing of the original image by the convolutional
neural network 400 may iteratively apply each successive layer until every layer has performed its respective operation. As schematically illustrated byFIG. 4 , the convolutionalneural network 400 may include alternating convolutional layers and pooling layers. These alternating layers may enable creation of multiple feature maps of various sizes. Each of the feature maps may correspond to one of a plurality of input image features, which may be used for performing grapheme recognition. - In some embodiments, the penultimate layer (e.g., the
pooling layer 417B (a fully connected layer)) of the convolutionalneural network 400 may produce a feature vector representative of the features of the original image, which may be regarded as a representation of the original image in the multi-dimensional space of image features. - The feature vector may be fed to the fully-connected
output layer 419, which may generate a vector of class weights, such that each weight would characterize the degree of association of the input image with a grapheme class of a set of classes (e.g., a set of languages). The vector of class weights may then be transformed, e.g., by a normalized exponential function, into a vector of probabilities, such that each probability would characterize a hypothesis of the input grapheme image representing an instance of a certain grapheme class of a set of classes (e.g., English language). - While
FIG. 4 illustrates a certain number of layers of the convolutionalneural network 400, convolutional neural networks employed in various alternative implementations may include any suitable numbers of convolutional layers, ReLU layers, pooling layers, batch normalization, dropout, and/or any other layers. The convolutionalneural network 400 may be trained by forward and backward propagation using images from a training dataset, which includes images of graphemes and respective class identifiers (e.g., characters of particular languages) reflecting the correct classification of the images. -
FIG. 5 illustrates a block diagram of the second level neural network in accordance with one or more aspects of the present disclosure. The second level neural network (as part ofDNN classifiers 330 ofFIG. 3 ) may be implemented as a modified convolutionalneural network 500. The example modified convolutionalneural network 500 may be similar to convolutionalneural network 400 and may include a sequence of layers of different types, such as convolutional layers, pooling layers, rectified linear unit (ReLU) layers, and fully connected layers, each of which may perform a particular operation in recognizing the text in an input image. - Referring to
FIG. 5 , the output of penultimate (or fully connected) poolinglayer 417B may be viewed as performing the functions of a feature extractor, e.g.,feature extractor 331 ofFIG. 3 . In one embodiment,network 500 includes aconcatenation layer 518 which concatenates geometric features input 510 of theinput image 301 with the output features of poolinglayer 417B. In one embodiment, the geometric features of theinput image 301 may include geometric features of the grapheme in the grapheme image. Such geometric features may include: ratio of pixels of a width of the grapheme image to a height of the grapheme image, a ratio of a base line (bottom) of a symbol in the grapheme image to the height of the grapheme image, and a ratio of height (top) of symbol in the grapheme image to the height of the grapheme image. Here, the geometric features may be used to distinguish between commas and apostrophes. The output forconcatenate layer 518 is then fed through one or more fullyconnected layers 519A-B followed byoutput layer 520. -
Output layer 520 may correspond to a vector of class weights, where each weight would characterize the degree of association of the input image with a grapheme class of a set of classes (e.g., a set of alphabet characters A, B, C, etc.). The vector of class weights may then be transformed, e.g., by a normalized exponential function, into a vector of probabilities, such that each probability would characterize a hypothesis of the input grapheme image representing an instance of a certain grapheme class of a set of classes. - In some embodiments, the vectors of class weights and/or probabilities produced by fully-connected
output layer 520 may be used in network training or inferencing. For example, while in operation the feature vector produced by the penultimate layer (e.g., thepooling layer 417B) of the convolutionalneural network 500 may be fed to the above-describedconfidence function 335, which produces a vector of classification confidence scores, such that each classification confidence score (e.g., selected from the range of 0-1) reflects the level of confidence of the hypothesis of the input grapheme image representing an instance of a certain class of the set of classes. In some embodiments, the classification confidence score computed for each class of the set of classes by the confidence function may be represented by the distance between the feature vector of the input image and the center of the respective class. - The convolutional
neural network 500 may be trained by forward and backward propagation based on a loss function and images from a training dataset, which includes images of graphemes and respective class identifiers (e.g., characters of alphabets, A, B, C, . . . ) reflecting the correct classification of the images. For example, a value of the loss function may be computed (forward propagation) based on the observed output of the convolutional neural network (i.e., the vector of probabilities) and the desired output value specified by the training dataset (e.g., the grapheme which is in fact shown by the input image, or, in other words the correct class identifier). The difference between the two values are backward propagated to adjust weights in the layers of convolutionalneural network 500. - In one embodiment, the loss function may be represented by the Cross Entropy Loss (CEL), which may be expressed as follows:
-
-
- where i is the number of input image in the batch of input images,
- ji is the correct class identifier (e.g., grapheme identifier) for the i-the input image, and
- Pj
i is the probability produced by the neural network for i-th input image representing the j-th class (i.e., for the correct classification of the i-th input image).
- The summing is performed by all input images from the current batch of input images. The identified classification error is propagated back to the previous layers of the convolutional neural network, in which the network parameters are adjusted accordingly. This process may be repeated until the value of the loss function would stabilize in the vicinity of a certain value or fall below a predetermined threshold. The neural network trained using the CEL loss function would place the instances of the same class along a certain vector in the feature space, thus facilitating efficient segregation of instances of different classes.
- While CEL loss function may be adequate for distinguishing images of different graphemes, it may not always produce satisfactory results in filtering out invalid image graphemes. Accordingly, the Center Loss (CL) function may be employed in addition to the CEL function, thus compacting the representation of each class in the feature space, such that all instances of a given class would be located within a relatively small vicinity of a certain point, which would thus become the class center, while any feature representation of an invalid grapheme image would be located relatively further away (e.g., at a distance exceeding a pre-defined or dynamically configured threshold) from any class center.
- In one embodiment, the Center Loss function may be expressed as follows:
-
-
- where i is the number of input image in the batch of input images,
- Fi is the feature vector of the i-th input image,
- j is the correct class identifier (e.g., grapheme identifier) for the i-the input image, and
- Cj is the vector of the center of the j-th class.
- The summing is performed by all input images from the current batch of input images.
- The center class vectors Cj may be computed as the average of all features of the images which belong to the j-th class. As schematically illustrated by
FIG. 3 , the computedcenter class vectors 333 may be stored in the memory accessible by thegrapheme recognizer module 104. - In one embodiment,
DNN classifiers 330 may be trained using a loss function represented by a linear combination of the CEL and CL functions, assuming zeroes as the initial values of the center class vectors. The values of the center class vectors may be re-computed after processing each training dataset (i.e., each batch of input images). - In another embodiment,
DNN classifiers 330 may initially be trained using the CEL function, and initial values of the center class vectors may be computed after completing the initial training stage. The subsequent training may utilize a linear combination of the CEL and CL functions, and the values of the center class vectors may be re-computed after processing each training dataset (i.e., each batch of input images). - Employing a combination of CEL and CL functions for neural network training would produce compact representation of each class in the feature space, such that all instances of a given class would be located within a relatively small vicinity of a certain point, which would thus become the class center, while any feature representation of an invalid grapheme image would be located relatively further away (e.g., at a distance exceeding a pre-defined or dynamically configured threshold) from any class center.
- In one embodiment, the loss function L may be represented by a linear combination of CEL and CL functions, which may be expressed as follows:
-
L=CEL+α*CL - where α is a weight coefficient the value of which may be adjusted to throttle the CL impact on the resulting loss function value, thus avoiding over-narrowing the feature range for instances of a given class.
- The confidence function may be designed to ensure that the grapheme recognizer would assign low classification confidence scores to invalid grapheme images. Accordingly, the confidence of associating a given image with a certain class (e.g., recognizing a certain grapheme in the image) would thus reflect the distance between the feature vector of the image and the center of the class, which may be expressed as follows:
-
d k =∥F−C k∥2 - where dk is the distance between the center Ck of k-th class and the feature vector F of a given image.
- In one embodiment,
distance confidence function 335 may be represented by a monotonically decreasing function of the distance between the class center and the feature vector of an input image in the space of image features. Thus, the further the feature vector is located from the class center, the less would be the classification confidence score assigned to associating the input image with this class. - In one embodiment,
distance confidence function 335 may be provided by a piecewise-linear function of the distance. Thedistance confidence function 335 may be performed by selecting certain classification confidence scores qi and determining the corresponding distance values di that would minimize the number of classification errors produced by the classifier processing a chosen validation dataset (which may be represented, e.g., by a set of document images (e.g., images of document pages) with associated metadata specifying the correct classification of the graphemes in the image). In some embodiments, the classification confidence scores qi may be chosen at equal intervals within the valid range of classification confidence scores (e.g., 0-1). Alternatively, the intervals between the classification confidence scores qi may be chosen to increase while moving along the classification confidence score range towards to lowest classification confidence score, such that the intervals would be lower within a certain high classification confidence score range, while being higher within a certain low classification confidence score range. -
FIG. 6 schematically illustrates an example confidence function Q(d) implemented in accordance with one or more aspects of the present disclosure. As schematically illustrated byFIG. 6 , the classification confidence scores qk may be chosen at pre-selected intervals within the valid range of classification confidence scores (e.g., 0-1), and then the corresponding values dk may be determined. If higher sensitivity of the function to its inputs in the higher range of function values is desired, the qk values within a certain high classification confidence score range may be selected at relatively small intervals (e.g., 1; 0.98; 0.95; 0.9; 0.85; 0.8; 0.7; 0.6; . . . ). The distances Δk between neighboring dk values (e.g., dk=dk-1+Δk) may then be determined by applying optimization methods, such as the differential evolution method. The confidence function Q(d) may then be constructed as a piecewise linear function connecting the computed points (dk, qk). - In some embodiments, the classification confidence scores may only be determined for a subset of the classification hypotheses which the classifier has associated with high probabilities (e.g., exceeding a certain threshold).
- Using the above-described loss and confidence functions ensures that, for the majority of invalid grapheme images, low classification confidence scores would be assigned to hypotheses associating the input images with all possible graphemes. A clear advantage of applying the above-described loss and confidence functions is training the classifier without requiring the presence of negative samples in the training dataset, since, as noted herein above, all possible variations of invalid images may be difficult to produce, and the number of such variations may significantly exceed the number of valid graphemes.
- In some embodiments, the
DNN classifiers 330 trained using the above-described loss and confidence functions may still fail to filter out a small number of invalid grapheme images. For example, a hypothesis associating an invalid grapheme image with a certain class (i.e., erroneously recognizing a certain grapheme within the image) would receive a high classification confidence score if the feature vector of the invalid grapheme image is located within a relatively small vicinity of a center of the class. While the number of such errors tends to be relatively small, the above-described loss function may be enhanced in order to filter out such invalid grapheme images. - In one embodiment, the above-described loss function represented by a linear combination of the CEL function and the CL function may be enhanced by introducing a third term, referred herein as Close-to-Center Penalty Loss (CCPL) function, which would cause the feature vectors of known types of invalid images be removed from the centers of all classes. Accordingly, the loss function may be expressed as follows:
-
L=CEL+α*CL+β*CCPL - Training the second level neural network using the loss function with the CCPL may involve iteratively processing batches of input images, such that each batch includes positive samples (images of valid graphemes) and negative samples (invalid grapheme images). In certain implementations, CEL+α*CL term may be only computed for positive samples, while β*CCPL term may be only computed for negative samples.
- In an illustrative example, the training dataset may include the negative samples represented by real invalid grapheme images which were erroneously classified as valid images and assigned classification confidence scores exceeding a certain pre-determined threshold. In another illustrative example, the training dataset may include the negative samples represented by synthetic invalid grapheme images.
- The CCPL function, which is computed for negative training samples, may be expressed as follows:
-
-
- where Fj neg is the feature vector for j-th negative training sample,
- Ci is the center of the i-th class, and
- A is a pre-defined or adjustable parameter defining the size of the neighborhood of the class center (i.e., the distance to the class center) in the space of image features, such that the feature vectors located within the neighborhood are penalized, while the penalty would not be applied to the feature vectors located outside of the neighborhood.
- Therefore, if the feature vector of a negative sample is located within a distance not exceeding the value of parameter A from the center of the i-th class, then the value of the CCPL function is incremented by that distance. Training the DNN classifiers involves minimizing the CCPL value. Accordingly, the trained DNN classifiers would, for an invalid grapheme image, yield a feature vector which is located outside of immediate vicinities of the centers of valid classes. In other words, the classifier is trained to distinguish between valid graphemes and invalid grapheme images.
- Referring to
FIGS. 3 and 5 , in one embodiment, a particular second level neural network 500 (as part of DNN classifiers 330) is trained for a particular language group to generate a second level neural network model for that particular language (e.g., set of symbols), where a corresponding second level neural network model is selected corresponding to the particular trained language when the language is specified. As noted herein above, the second level neural network models trained by the methods described herein may be utilized for performing various image classification tasks, including but not limited to the text recognition. - Referring to
FIGS. 3 , once theDNN classifiers 330 is trained, a particular second level neural network model (as part of DNN classifiers 330) may be used to classify a grapheme image to a set of classes and to generate classification confidence scores for the set of classes.Combiner 340 may then combine the classification confidence scores with classification confidence scores of a structural classified.Evaluator 350 may evaluator (or discard) for a best class of grapheme using the combined classification confidence score.Grapheme recognizer module 104 may iterate through the grapheme images of a hypothesis and generate a (combined) classification confidence score for each grapheme image. - Referring again to
FIGS. 1A-B , in one embodiment, once the (combined) classification confidence scores are determined for each grapheme image within a hypothesis,confidence score module 105 may combine the classification confidence scores of all the grapheme image within a hypothesis to generate a fragmentation confidence score for the particular hypothesis for the graphemes divisions. Among several hypotheses for the grapheme divisions, a hypothesis with the highest fragmentation confidence score is selected as the final hypothesis (denoted as the fragmentation confidence score for the fragment). In one embodiment, the fragmentation confidence scores for each fragment may be combined to determine a final confidence score for a hypothesis, the hypothesis to divide the image with the line of text into its fragment image constituents. Among the several hypotheses for the fragments divisions, a hypothesis with a highest final confidence score is selected as the final hypothesis for the fragments division. In one embodiment, spacing andtranslation module 106 then translates the final hypothesis for the fragments division tooutput symbols 108. In one embodiment, the spacing/punctuation adjustment may be applied to theoutput symbols 108 where necessary using a rule-based algorithm. For example, a period follows the end of a sentence, double spacing is applied in between sentences, etc. - In another embodiment,
correction module 107 may further adjust thesymbols 108 taking into account a context for the output symbols. For example, a third level deep (convolutional) neural network model may classify thesymbols 108 into a number, name, address, email address, date, language, etc. to give context to thesymbols 108. In another embodiment, the symbols 108 (or words within the symbols) with the context may be compared with a dictionary, a morphological model, a syntactic model, etc. For example, an ambiguous word (words a confidence below a certain threshold) may be compared with a dictionary definition of neighboring words for a match and the ambiguous word may be corrected in response to a match. -
FIGS. 7-8 are flow diagrams of various implementations of methods related to text recognition. The methods are performed by processing logic that may include hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine), or a combination of both. The methods and/or each of their individual functions, routines, subroutines, or operations may be performed by one or more processors of a computing device (e.g.,computing system 800 ofFIG. 9 ) implementing the methods. In certain implementations, the methods may be performed by a single processing thread. Alternatively, the methods may be performed by two or more processing threads, each thread implementing one or more individual functions, routines, subroutines, or operations of the methods.Method 700 may be performed bytext recognition system 100 ofFIG. 1A , andmethod 750 may be performed bytext recognizer module 104 ofFIG. 1A . - For simplicity of explanation, the methods are depicted and described as a series of acts. However, acts in accordance with this disclosure may occur in various orders and/or concurrently, and with other acts not presented and described herein. Furthermore, not all illustrated acts may be required to implement the methods in accordance with the disclosed subject matter. In addition, those skilled in the art will understand and appreciate that the methods could alternatively be represented as a series of interrelated states via a state diagram or events.
-
FIG. 7 depicts a flow diagram of a method to recognize a line of text in accordance with one or more aspects of the present disclosure. Atblock 701, processing logic receives an image with a line of text. Atblock 703, processing logic segments the image into two or more fragment images. Atblock 705, for each of the two or more fragment images, processing logic determines a first hypothesis to segment the fragment image into a first plurality of grapheme images. Atblock 707, processing logic determines a first fragmentation confidence score for the first hypothesis. Atblock 709, processing logic determines a second hypothesis to segment the fragment image into a second plurality of grapheme images. Atblock 711, processing logic determines a second fragmentation confidence score for the second hypothesis. Atblock 713, processing logic determines that the first fragmentation confidence score is greater than the second fragmentation confidence score. Atblock 715, processing logic translates the first plurality of grapheme images defined by a first hypothesis to a plurality of symbols. Atblock 717, processing logic assembles the plurality of symbols of each fragment image to derive the line of text. - In one embodiment, segmenting the image into two or more fragment images is performed based on visual features (or gaps) identified in the line of text. In one embodiment, determining the first fragmentation confidence score further comprises: applying a first level neural network model to one of the first plurality of grapheme images to determine a language grouping for the grapheme image. In another embodiment, processing logic further selects a second level neural network model based on the language grouping, applies the second level neural network model to the grapheme image to determine a classification confidence score for the grapheme image, and determining the first fragmentation confidence score based on classification confidence scores for each of the first plurality of grapheme images.
- In one embodiment, a particular second level neural network model corresponds to a particular grouping of graphemes to be recognized, wherein the particular language grouping corresponds to a group of symbols of a particular language.
- In one embodiment, processing logic further applies a structural classifier to the grapheme image, determines a modified classification confidence score for the grapheme image based on the structural classification, and determines the first fragmentation confidence score based on modified classification confidence scores for each of the first plurality of grapheme images. In one embodiment, processing logic further determines whether the first fragmentation confidence score is greater than a predetermined threshold, and responsive to determining the first fragmentation confidence score is greater than the predetermined threshold, processing logic translates the first plurality of grapheme images of the first hypothesis to a plurality of symbols.
- In some embodiments, a second level neural network model is trained using a loss function, wherein the loss function is represented by a combination of a cross entropy loss function, a center loss function, and/or a close-to-center penalty loss function. In one embodiment, the second level neural network model includes a first input for a grapheme image, a second input for geometric features of the grapheme image, and a concatenate layer to concatenate the geometric features of the grapheme image to an inner layer of the second level neural network model. In one embodiment, processing logic further verifies the plurality of symbols with a morphological, a dictionary, or a syntactical model.
-
FIG. 8 depicts a flow diagram of a method to recognize a grapheme in accordance with one or more aspects of the present disclosure. Atblock 751, processing logic iterates through a plurality of grapheme images of a hypothesis. Atblock 753, processing logic recognizes a grapheme within a grapheme image and obtains a first classification confidence score for the grapheme image. Atblock 755, processing logic verifies the recognized grapheme with a structural classifier and obtains a second classification confidence score. Atblock 757, processing logic obtains a (combined) classification confidence score for the recognized graphemes based on the first and the second classification confidence scores. Atblock 759, processing logic determines whether the classification confidence score is above a predetermined threshold. Atblock 761, processing logic outputs the recognized graphemes and classification confidence scores for the plurality of grapheme images. - In one embodiment, recognizing a grapheme further comprising recognizing the grapheme using a second level convolutional neural network model. In one embodiment, processing logic assembles the recognized graphemes to derive a line of text.
-
FIG. 9 depicts anexample computer system 800 which may perform any one or more of the methods described herein. In one example,computer system 800 may correspond to a computing device capable of executingtext recognition component 110 ofFIG. 1A . The computer system may be connected (e.g., networked) to other computer systems in a LAN, an intranet, an extranet, or the Internet. The computer system may operate in the capacity of a server in a client-server network environment. The computer system may be a personal computer (PC), a tablet computer, a set-top box (STB), a personal Digital Assistant (PDA), a mobile phone, a camera, a video camera, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device. Further, while only a single computer system is illustrated, the term “computer” shall also be taken to include any collection of computers that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methods discussed herein. - The
exemplary computer system 800 includes aprocessing device 802, a main memory 804 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM)), a static memory 806 (e.g., flash memory, static random access memory (SRAM)), and adata storage device 816, which communicate with each other via a bus 808. -
Processing device 802 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, theprocessing device 802 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. Theprocessing device 802 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. Theprocessing device 802 is configured to executeinstructions 826 for performing the operations and steps discussed herein. - The
computer system 800 may further include anetwork interface device 822. Thecomputer system 800 also may include a video display unit 810 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 812 (e.g., a keyboard), a cursor control device 814 (e.g., a mouse), and a signal generation device 820 (e.g., a speaker). In one illustrative example, thevideo display unit 810, thealphanumeric input device 812, and thecursor control device 814 may be combined into a single component or device (e.g., an LCD touch screen). - The
data storage device 816 may include a computer-readable medium 824 on which is stored instructions 826 (e.g., corresponding to the methods ofFIGS. 7-8 , etc.) embodying any one or more of the methodologies or functions described herein.Instructions 826 may also reside, completely or at least partially, within themain memory 804 and/or within theprocessing device 802 during execution thereof by thecomputer system 800, themain memory 804 and theprocessing device 802 also constituting computer-readable media.Instructions 826 may further be transmitted or received over a network via thenetwork interface device 822. - While the computer-
readable storage medium 824 is shown in the illustrative examples to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable storage medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media. - Although the operations of the methods herein are shown and described in a particular order, the order of the operations of each method may be altered so that certain operations may be performed in an inverse order or so that certain operation may be performed, at least in part, concurrently with other operations. In certain implementations, instructions or sub-operations of distinct operations may be in an intermittent and/or alternating manner.
- It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other implementations will be apparent to those of skill in the art upon reading and understanding the above description. The scope of the disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
- In the above description, numerous details are set forth. It will be apparent, however, to one skilled in the art, that the aspects of the present disclosure may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present disclosure.
- Some portions of the detailed descriptions above are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
- It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “receiving,” “determining,” “selecting,” “storing,” “analyzing,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
- The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
- The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as set forth in the description. In addition, aspects of the present disclosure are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure as described herein.
- Aspects of the present disclosure may be provided as a computer program product, or software, that may include a machine-readable medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to the present disclosure. A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable (e.g., computer-readable) medium includes a machine (e.g., a computer) readable storage medium (e.g., read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory devices, etc.).
- Any aspect or design described herein as “example” or “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the words “example” or “exemplary” is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X includes A or B” is intended to mean any of the natural inclusive permutations. That is, if X includes A; X includes B; or X includes both A and B, then “X includes A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form. Moreover, use of the term “an embodiment” or “one embodiment” or “an implementation” or “one implementation” throughout is not intended to mean the same embodiment or implementation unless described as such. Furthermore, the terms “first,” “second,” “third,” “fourth,” etc. as used herein are meant as labels to distinguish among different elements and may not necessarily have an ordinal meaning according to their numerical designation.
Claims (20)
1. A method, comprising:
generating one or more hypotheses, each of the one or more hypotheses segmenting an image of a text into a plurality of fragment images, wherein a fragment image of the plurality of fragment images depicts one or more words of the text;
obtaining one or more fragmentation confidence scores, each fragmentation confidence score obtained for a respective hypothesis of the one or more hypotheses, by:
applying a recognition model to the respective plurality of fragment images to identify (i) a plurality of symbols corresponding to the respective plurality of fragment images, and (ii) a plurality of classification confidence scores associated with the respective plurality of fragment images; and
determining, using the plurality of classification confidence scores, the fragmentation confidence score for the respective hypothesis; and
using the one or more fragmentation confidence scores, selecting the plurality of symbols, identified for a winning hypothesis of the one or more hypotheses, as a recognized text.
2. The method of claim 1 , further comprising:
determining, using a language detection model, a language associated with the image; and
selecting the plurality of symbols corresponding to the respective plurality of fragment images from a corpus of symbols of the determined language.
3. The method of claim 1 , further comprising:
for each of the one or more hypotheses:
applying a structural classification model to the respective plurality of fragment images to identify an additional plurality of classification confidence scores characterizing structural similarity of a respective fragment image to one or more reference images; and
wherein the fragmentation confidence score for the respective hypothesis is further determined using the additional plurality of classification confidence scores.
4. The method of claim 1 , wherein the recognition model is trained using a loss function, wherein the loss function comprises one or more of:
a cross entropy loss function,
a center loss function, or
a close-to-center penalty loss function.
5. The method of claim 1 , wherein an input into the recognition model comprises:
a first input comprising one or more fragment images of the respective plurality of fragment images, and
a second input comprising one or more geometric features of the one or more fragment images.
6. The method of claim 5 , wherein the one or more geometric features comprise at least one aspect ratio for one or more graphemes in the one or more fragment images.
7. The method of claim 1 , further comprising:
validating the plurality of symbols, identified for the winning hypothesis, using one or more of a morphological model, a dictionary model, or a syntactical model.
8. A system, comprising:
a memory;
a processing device, communicatively coupled to the memory, the processing device to:
generate one or more hypotheses, each of the one or more hypotheses segmenting an image of a handwritten text into a plurality of fragment images, wherein a fragment image of the plurality of fragment images depicts one or more words of the text;
obtain one or more fragmentation confidence scores, each fragmentation confidence score obtained for a respective hypothesis of the one or more hypotheses, by:
applying a recognition model to the respective plurality of fragment images to identify (i) a plurality of symbols corresponding to the respective plurality of fragment images, and (ii) a plurality of classification confidence scores associated with the respective plurality of fragment images; and
determining, using the plurality of classification confidence scores, the fragmentation confidence score for the respective hypothesis; and
select, using the one or more fragmentation confidence scores, the plurality of symbols, identified for a winning hypothesis of the one or more hypotheses, as a recognized text.
9. The system of claim 8 , wherein the processing device is further to:
determine, using a language detection model, a language associated with the image; and
select the plurality of symbols corresponding to the respective plurality of fragment images from a corpus of symbols of the determined language.
10. The system of claim 8 , wherein the processing device is further to:
for each of the one or more hypotheses:
apply a structural classification model to the respective plurality of fragment images to identify an additional plurality of classification confidence scores characterizing structural similarity of a respective fragment image to one or more reference images; and
wherein the fragmentation confidence score for the respective hypothesis is further determined using the additional plurality of classification confidence scores.
11. The system of claim 8 , wherein the recognition model is trained using a loss function, wherein the loss function comprises one or more of:
a cross entropy loss function,
a center loss function, or
a close-to-center penalty loss function.
12. The system of claim 8 , wherein an input into the recognition model comprises:
a first input comprising one or more fragment images of the respective plurality of fragment images, and
a second input comprising one or more geometric features of the one or more fragment images.
13. The system of claim 12 , wherein the one or more geometric features comprise at least one aspect ratio for one or more graphemes in the one or more fragment images.
14. The system of claim 8 , wherein the processing device is further to:
validate the plurality of symbols, identified for the winning hypothesis, using one or more of a morphological model, a dictionary model, or a syntactical model.
15. A non-transitory computer-readable storage medium storing instructions that, when executed by a processing device system, cause the processing device to:
generate one or more hypotheses, each of the one or more hypotheses segmenting an image of a handwritten text into a plurality of fragment images, wherein a fragment image of the plurality of fragment images depicts one or more words of the text;
obtain one or more fragmentation confidence scores, each fragmentation confidence score obtained for a respective hypothesis of the one or more hypotheses, by:
applying a recognition model to the respective plurality of fragment images to identify (i) a plurality of symbols corresponding to the respective plurality of fragment images, and (ii) a plurality of classification confidence scores associated with the respective plurality of fragment images; and
determining, using the plurality of classification confidence scores, the fragmentation confidence score for the respective hypothesis; and
select, using the one or more fragmentation confidence scores, the plurality of symbols, identified for a winning hypothesis of the one or more hypotheses, as a recognized text.
16. The non-transitory computer-readable storage medium of claim 15 , wherein the instructions are further to cause the processing device to:
determine, using a language detection model, a language associated with the image; and
select the plurality of symbols corresponding to the respective plurality of fragment images from a corpus of symbols of the determined language.
17. The non-transitory computer-readable storage medium of claim 15 , wherein the instructions are further to cause the processing device to:
for each of the one or more hypotheses:
apply a structural classification model to the respective plurality of fragment images to identify an additional plurality of classification confidence scores characterizing structural similarity of a respective fragment image to one or more reference images; and
wherein the fragmentation confidence score for the respective hypothesis is further determined using the additional plurality of classification confidence scores.
18. The non-transitory computer-readable storage medium of claim 15 , wherein an input into the recognition model comprises:
a first input comprising one or more fragment images of the respective plurality of fragment images, and
a second input comprising one or more geometric features of the one or more fragment images.
19. The non-transitory computer-readable storage medium of claim 18 , wherein the one or more geometric features comprise at least one aspect ratio for one or more graphemes in the one or more fragment images.
20. The non-transitory computer-readable storage medium of claim 15 , wherein the instructions are further to cause the processing device to:
validate the plurality of symbols, identified for the winning hypothesis, using one or more of a morphological model, a dictionary model, or a syntactical model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/484,110 US20240037969A1 (en) | 2020-11-24 | 2023-10-10 | Recognition of handwritten text via neural networks |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
RU2020138488A RU2757713C1 (en) | 2020-11-24 | 2020-11-24 | Handwriting recognition using neural networks |
RU2020138488 | 2020-11-24 | ||
US17/107,256 US11790675B2 (en) | 2020-11-24 | 2020-11-30 | Recognition of handwritten text via neural networks |
US18/484,110 US20240037969A1 (en) | 2020-11-24 | 2023-10-10 | Recognition of handwritten text via neural networks |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/107,256 Continuation US11790675B2 (en) | 2020-11-24 | 2020-11-30 | Recognition of handwritten text via neural networks |
Publications (1)
Publication Number | Publication Date |
---|---|
US20240037969A1 true US20240037969A1 (en) | 2024-02-01 |
Family
ID=78286661
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/107,256 Active 2041-06-19 US11790675B2 (en) | 2020-11-24 | 2020-11-30 | Recognition of handwritten text via neural networks |
US18/484,110 Pending US20240037969A1 (en) | 2020-11-24 | 2023-10-10 | Recognition of handwritten text via neural networks |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/107,256 Active 2041-06-19 US11790675B2 (en) | 2020-11-24 | 2020-11-30 | Recognition of handwritten text via neural networks |
Country Status (2)
Country | Link |
---|---|
US (2) | US11790675B2 (en) |
RU (1) | RU2757713C1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220189188A1 (en) * | 2020-12-11 | 2022-06-16 | Ancestry.Com Operations Inc. | Handwriting recognition |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11450069B2 (en) | 2018-11-09 | 2022-09-20 | Citrix Systems, Inc. | Systems and methods for a SaaS lens to view obfuscated content |
US11544415B2 (en) | 2019-12-17 | 2023-01-03 | Citrix Systems, Inc. | Context-aware obfuscation and unobfuscation of sensitive content |
US11539709B2 (en) * | 2019-12-23 | 2022-12-27 | Citrix Systems, Inc. | Restricted access to sensitive content |
US11582266B2 (en) | 2020-02-03 | 2023-02-14 | Citrix Systems, Inc. | Method and system for protecting privacy of users in session recordings |
WO2022041163A1 (en) | 2020-08-29 | 2022-03-03 | Citrix Systems, Inc. | Identity leak prevention |
US20230140570A1 (en) * | 2021-11-03 | 2023-05-04 | International Business Machines Corporation | Scene recognition based natural language translation |
US11941232B2 (en) * | 2022-06-06 | 2024-03-26 | Adobe Inc. | Context-based copy-paste systems |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100449805B1 (en) * | 2001-12-26 | 2004-09-22 | 한국전자통신연구원 | Method for segmenting and recognizing handwritten touching numeral strings |
US7817857B2 (en) * | 2006-05-31 | 2010-10-19 | Microsoft Corporation | Combiner for improving handwriting recognition |
CN105988567B (en) * | 2015-02-12 | 2023-03-28 | 北京三星通信技术研究有限公司 | Handwritten information recognition method and device |
US10169871B2 (en) * | 2016-01-21 | 2019-01-01 | Elekta, Inc. | Systems and methods for segmentation of intra-patient medical images |
US10936862B2 (en) * | 2016-11-14 | 2021-03-02 | Kodak Alaris Inc. | System and method of character recognition using fully convolutional neural networks |
US20190258925A1 (en) * | 2018-02-20 | 2019-08-22 | Adobe Inc. | Performing attribute-aware based tasks via an attention-controlled neural network |
-
2020
- 2020-11-24 RU RU2020138488A patent/RU2757713C1/en active
- 2020-11-30 US US17/107,256 patent/US11790675B2/en active Active
-
2023
- 2023-10-10 US US18/484,110 patent/US20240037969A1/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220189188A1 (en) * | 2020-12-11 | 2022-06-16 | Ancestry.Com Operations Inc. | Handwriting recognition |
Also Published As
Publication number | Publication date |
---|---|
US20220164589A1 (en) | 2022-05-26 |
US11790675B2 (en) | 2023-10-17 |
RU2757713C1 (en) | 2021-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240037969A1 (en) | Recognition of handwritten text via neural networks | |
US10936862B2 (en) | System and method of character recognition using fully convolutional neural networks | |
Balaha et al. | Automatic recognition of handwritten Arabic characters: a comprehensive review | |
Parvez et al. | Offline Arabic handwritten text recognition: a survey | |
US8731300B2 (en) | Handwritten word spotter system using synthesized typed queries | |
US8566349B2 (en) | Handwritten document categorizer and method of training | |
Seethalakshmi et al. | Optical character recognition for printed Tamil text using Unicode | |
CN110114776B (en) | System and method for character recognition using a fully convolutional neural network | |
Kaur et al. | A comprehensive survey on word recognition for non-Indic and Indic scripts | |
Joshi et al. | Deep learning based Gujarati handwritten character recognition | |
Bhowmik et al. | Handwritten Bangla word recognition using HOG descriptor | |
Borovikov | A survey of modern optical character recognition techniques | |
US20220027662A1 (en) | Optical character recognition using specialized confidence functions | |
Sahare et al. | Robust character segmentation and recognition schemes for multilingual Indian document images | |
Igorevna et al. | Document image analysis and recognition: a survey | |
Kasem et al. | Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey | |
Hemanth et al. | CNN-RNN BASED HANDWRITTEN TEXT RECOGNITION. | |
Badry et al. | Quranic script optical text recognition using deep learning in IoT systems | |
Gupta et al. | Machine learning tensor flow based platform for recognition of hand written text | |
Al Sayed et al. | Survey on handwritten recognition | |
Zhang et al. | OCR with the Deep CNN Model for Ligature Script‐Based Languages like Manchu | |
Kesorn et al. | Optical Character Recognition (OCR) enhancement using an approximate string matching technique. | |
Mostafa et al. | An end-to-end ocr framework for robust arabic-handwriting recognition using a novel transformers-based model and an innovative 270 million-words multi-font corpus of classical arabic with diacritics | |
NIGAM et al. | Document Analysis and Recognition: A survey | |
Hakro et al. | A Study of Sindhi Related and Arabic Script Adapted languages Recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: ABBYY DEVELOPMENT INC., DELAWARE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:UPSHINSKII, ANDREI;REEL/FRAME:065173/0769 Effective date: 20201126 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |