[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

EP0396593A1 - Character recognition apparatus - Google Patents

Character recognition apparatus

Info

Publication number
EP0396593A1
EP0396593A1 EP89900859A EP89900859A EP0396593A1 EP 0396593 A1 EP0396593 A1 EP 0396593A1 EP 89900859 A EP89900859 A EP 89900859A EP 89900859 A EP89900859 A EP 89900859A EP 0396593 A1 EP0396593 A1 EP 0396593A1
Authority
EP
European Patent Office
Prior art keywords
code
character
primitive
primitives
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP89900859A
Other languages
German (de)
French (fr)
Inventor
Shiu-Chang Loh
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of EP0396593A1 publication Critical patent/EP0396593A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/17Image acquisition using hand-held instruments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/32Digital ink
    • G06V30/36Matching; Classification
    • G06V30/373Matching; Classification using a special pattern or subpattern alphabet

Definitions

  • the present invention relates to an apparatus and method for identifying characters.
  • an ideographic character detection apparatus for receiving and identifying handwritten ideographic characters.
  • the apparatus requires that the ideographic character be written on an input device and that the written characters be formed from predetermined fundamental strokes or primitives which are typical strokes used by everyone who writes in the ideographic language.
  • the apparatus examines the primitives forming the entered ideographic character and compares the entered primitives with the contents of a look-up table.
  • the look-up table stores a plurality of variations of each of the predetermined primitives to accommodate variations in user's handwriting. Due to the large number of variations of each primitive stored in the table, the primitives forming the character are usually determined by the device.
  • the table also stores the sets of primitives used to form each of the characters in the ideographic language. If the set primitives forming the entered character corresponds with one of the sets of primitives in the look-up table, an output code associated with the set of primitives is generated and conveyed to an output device. This allows a hard copy image of the entered ideographic character to be formed.
  • a problem exists in that due to the large number of variations of each primitive stored in the table, the processing speed of the apparatus is greatly reduced making it unsuitable for real-time applications.
  • the number of predetermined fundamental strokes or primitives used in-this apparatus has typically been chosen to be five or less or twenty or more.
  • the number of predetermined fundamental strokes or primitives used in-this apparatus has typically been chosen to be five or less or twenty or more.
  • said apparatus comprising: input means for receiving successively each of the primitives forming said character and generating input signals for each of said received primitives; processing means receiving said input signals and identifying each of said primitives received by said input means, said processing means generating a character code representing said character upon identification of said primitives; storage means storing a character code and an associated output code for each of the characters in said set; comparing means comparing said character code generated for said entered character with each of said character codes in " said storage means to identify said entered character; and output means in communication with said comparison means and generating a reproduction of said entered character upon the identification thereof by said comparison means.
  • the apparatus further includes differentiation means examining said input signals generated for each of said primitives and performing operations thereon, when said character code is equivalent to a character code associated with a plurality of output codes to identify the output code associated with said character.
  • the apparatus is provided with substitution means for selecting the character code stored in the storage means having the highest probability of being equivalent to the character code generated for the entered character, when the input
  • BST rruTHSHser character code is not equivalent to any of the character codes stored in the storage means. It is also preferred that the output means comprises at least one device chosen form the group comprising a printer, audio synthesizer or video display terminal to allow a reproduction of the received ideographic character to be formed or an audio reproduction of the ideographic character to be produced.
  • the character recognition apparatus is capable of recognizing characters written in all ideographic languages, upper case English language characters, and Russian characters.
  • the predetermined set of fundamental primitives is chosen to comprise 20 unique primitives, the various combinations of which Will form substantially all characters in a plurality of different languages, whilst decreasing the occurrence of different characters being formed from the same series of primitives.
  • the use of twenty distinct primitives decreases the occurrence of entered characters being represented character codes which are equivalent to a character code associated with more than one international output code. This of course, increases the probability of detecting the correct ideographic character.
  • Figure 1 is a functional block diagram of an apparatus for identifying characters
  • Figure 2 is an illustration of an ideographic character
  • SUBSTITUTE SHEET Figure 3 are illustrations of the fundamental primitives used in the device illustrated in Figure 1;
  • Figures 4a to 4c is an illustration of the method of forming the character shown in Figure 2 from the primitives shown in Figure 3;
  • FIG. 5 is a more detailed functional block diagram of the device illustrated in Figure 1;
  • Figure 6 is a detailed functional block diagram of a portion of the device illustrated in Figure 1;
  • Figure 7 is an illustration of a coding method used in the device illustrated in Figure 1;
  • Figures 8a and 8b are illustrations of entered fundamental strokes; Figures 9a and 9b are illustrations of still more ideographic characters;
  • Figure 10 is an illustration of a probability matrix used in the device illustrated in Figure 1;
  • Figure 11 is an illustration of an English character
  • Figure 12 is an illustration of more English characters.
  • the apparatus 10 comprises an input device 12 connected to a data processor 14.
  • the input device 12 receives the handwritten character and converts the character into a series of signals that are conveyed to the data processor 14.
  • the data processor 14 processes the received signals in order to detect the character entered on the input device 12.
  • An output device 16 is also connected to the data processor 14 and receives an international ASCII output code representing the handwritten character that was received by the input
  • the apparatus 10 is operable in a number of 5 modes, each mode of which allows handwritten characters of a different language to be recognized and reproduced.
  • Selection means 18 are provided to allow a user to select the language in which the apparatus 10 is to operate.
  • the processing means 14 is responsive to
  • the selection means 18 and is partitioned into sections 14a, 14b,..., 14n so that appropriate information for each language is separately stored and accessible depending on the mode selected by the selection means 18.
  • an ideographic character IC is shown. As can be seen, the ideographic
  • 2.5 character IC is formed from a number of fundamental strokes or primitives, the primitives being labelled as Pr- to Pr 3 respectively.
  • the primitives Pr x to Pr 3 are fundamental strokes used when writing in the ideographic language.
  • the writing order of the sequence of strokes for ideographic characters is mainly based on logic efficiency, experience and natural human habits. According to several research findings, there exist a
  • Each Chinese character may employ one or more of the above rules in the formation of the character.
  • Examples of basic stroke sequences of ideographic characters are illustrated in Table 1 hereinbelow:
  • the fifteen primitives Pr a to Pr 0 are members of the set of fundamental strokes typically used in the formation of ideographic characters. This sub-set of primitives is chosen since all of the ideographic characters in the various languages can be formed from various combinations of the primitives Pr a to Pr 0 .
  • the primitives Pr to Pr t are used with some of the primitives Pr a to Pr ⁇ when- the apparatus is operating to detect characters written in another language as will be described.
  • the input device 12 comprises an on-line digitizer tablet 20 having a stylus 20a.
  • the ideographic character to be recognized is written on the tablet 20 with the stylus 20a.
  • This causes a series of cartesian co-ordinate data point signals PN 0 to PN N to be generated for each of the primitives Pr a to Pr 0 entered that form the ideographic character IC.
  • the upper case "N" of the data point signal refers to the order in which the primitive was entered when forming the character IC while the subscript "N" refers to the number of the sampled point along the primitive.
  • the data point signals are then conveyed to the data processor 14.
  • a memory 22 is located in the data processor 14 and is connected to the digitizer tablet 20.
  • the memory 22 receives the raw cartesian co-ordinate data point signals and stores them prior to processing.
  • a pre-processor 24 receives a copy of the cartesian co ⁇ ordinate data point signals PN 0 to PN N for each entered primitive and processes the data to remove redundant and spurious data.
  • the pre-processed cartesian co-ordinate data signals are conveyed from the pre-processor 24 to a feature extraction section 26 which converts the cartesian co-ordinate data point signals for each of the entered primitives Pr into a vector code and a series of scalars.
  • the vector code and series of scalars generated by the feature extraction section 25 are
  • SUBSTITUTESHEET applied to a primitive detection section 28 which compares the vector code generated for each entered primitive Pr a to Pr 0 forming the character IC with the contents of a look-up table or dictionary. This allows the processor 14 to detect whether the entered primitives are members of the fifteen primitives Pr. to Pr 0 .
  • a primitive code a to o is generated and conveyed to a memory 30. This process is performed for each vector code representing each primitive Pr forming the entered ideographic character IC.
  • a series of primitive codes or a character code is generated for the entered character which represents the ideographic character IC.
  • the detection section 28 performs tests on the series of scalars associated with the generated vector code to detect the correct entered primitive.
  • the generated character code is conveyed from the memory 30 to a character detection section 32 and compared with the contents of a second look-up table or dictionary.
  • Section 32 stores the character code representing each of the ideographic characters in the language.
  • the stored character codes are based on the requirement that the ideographic characters are formed from a combination of the fifteen primitives illustrated in Figure 3 and that the characters are entered on the tablet 20 in an order as determined by the previously mentioned rules. Since the previously mentioned rules are generally used when writing in an ideographic language, character codes which can represent
  • the character detection section 32 When the character code generated for the entered ideographic character IC is equivalent to a character code found in the character detection section 32, an associated output code or international ASCII output code is outputted to a memory 34. However, if the character code is equivalent to a character code representing more than one ideographic character, the character detection section 32 performs operations on the raw cartesian co-ordinate data point signals stored in the memory 22 to determine the correct ideographic character IC to which the character code represents.
  • a substitution and correction means 36 is also provided and examines the entered character code when it is not equivalent to a character code stored in the character detection section 28.
  • the substitution means 36 substitutes for the entered character code, the most probable character code that the entered character code was supposed to represent and conveys it back to the character detection section 32 wherein the above- mentioned process is performed.
  • the international ASCII code representing the ideographic character IC stored in the memory 34 is applied to the output device or devices 16 which typically include a video display terminal (VDT) 16a, printer 16b and/or a video synthesizer 16c wherein an audio and/or visual reproduction of the ideographic character IC can be formed.
  • VDT video display terminal
  • FIG. 6 the processing means 14 is better illustrated.
  • the pre-processor 24 comprises a comparator 24a and a memory 24b which function in a manner to be described to eliminate redundant and 5 spurious cartesian co-ordinate data point signals.
  • the feature extraction section 26 includes a second comparator 26a and a look-up table or dictionary 26b which, function to generate vectors for adjacent cartesian co-ordinate data point signals forming each 0 primitive Pr.
  • a memory 26c receives the vectors and in turn conveys the vectors to a third comparator 26d.
  • the comparator 26d examines the vectors and removes redundant information to form a series of unit vectors or a vector code for each primitive Pr and a series of 5 scalars.
  • the scalars represent the length of each unit vector in the vector code generated for each primitive.
  • the vector code and series of scalars generated for each primitive Pr are conveyed to a memory 26e and stored prior to being conveyed to the primitive detection 0 section 28.
  • the primitive detection section 28 includes a fourth comparator 28a connected to a second look-up table or dictionary 28b.
  • the table 28b stores a list of
  • the primitive detection section 28 also comprises a memory 28c which holds the scalars generated for each vector
  • test section 28d performs operations on the series of scalars if the vector code associated therewith is equivalent to a vector code which represents more than one of the fifteen primitives. This allows the correct primitive
  • SUBSTITUTE SHEET the primitive code a to o associated therewith is applied to the memory 30.
  • the series of primitive codes or character code generated for the entered ideographic character IC is conveyed to the character detection section 32 which comprises a fifth comparator 32a and a third look-up table or dictionary 32b.
  • the dictionary 32b stores a list of the character codes forming each of the ideographic characters in the language and an associated international output code.
  • the comparator 32a and the dictionary 32b function to detect whether the character code representing the entered ideographic character IC is equivalent to a character code representing one or more of the ideographic characters.
  • the character detection section 32 also includes a differentiator 32c which performs tests on the raw cartesian co-ordinate data point signals if the character code is equivalent to a character code which represents more than one ideographic character. This allows the correct ideographic character to be detected.
  • the international ASCII code associated therewith is conveyed to the memory 34 and in turn to the output device 16.
  • the substitution section 36 includes a probability matrix 36a, a sixth comparator 36b and a memory 36c which collectively function to determine the most probable character code that the character code generated for the entered ideographic character IC was supposed to be. This increases the probability of
  • the stylus 20a When an ideographic character IC is to be entered into the apparatus 10 via the digitizer tablet 20, the stylus 20a is placed on the tablet 20 and each of the primitives Pr forming the ideographic character IC is drawn separately.
  • the primitives used to form the ideographic character IC must be substantially equivalent to one of the fifteen primitives Pr. to Pr 0 .
  • this limitation does not pose many problems since each of the fifteen primitives are fundamental strokes used by substantially everyone who is capable of writing in an ideographic language.
  • the primitives Pr a to Pr 0 are chosen to reduce the number of entered characters that generate the same character code when inputted into the apparatus 10 and to simplify processing in section 14.
  • the stylus 20a is removed from the tablet 20 for a predetermined length of time. This results in a time-out signal being generated which allows the data processor 14 to recognize that the primitive Pr has been completely entered. Thereafter, the next primitive forming the character is entered and a time-out signal is generated. This process continues until each primitive forming the character has been entered into the apparatus 10.
  • a series of cartesian co ⁇ ordinate data point signals are generated.
  • the data processor 14 samples the cartesian co-ordinate data point signals generated for each primitive at a sampling rate of approximately 100 samples per second and stores the sampled co-ordinate data signals in the memory 22.
  • the sampled data for each primitive is continuously
  • the second sampled data point signal is deleted and the distance between the first and the third sampled cartesian co-ordinate data point signals is examined. This process continues until the distance between two data point signals is greater than the threshold value.
  • the first data point signal is conveyed to the memory 24b and the other data point signal is compared with the next preceding data point signal.
  • the second cartesian co-ordinate data point signal is compared with the third data point signal. If the distance between the second and third data point signals is larger than the second threshold value, the second data point signal is assumed to have been generated due to an inadvertent miscoupling of the stylus 20a and the tablet 20 and is deleted. However, if the distance between the second data point signal and the third data point signal is less than the second threshold value, the first data point signal is assumed to have been generated inadvertently and is deleted. This process is performed on the sampled cartesian co-ordinate data point signals for each of the entered primitives forming the entered character and hence, reduces the amount of data that requires processing.
  • the ideographic character IC illustrated in Figure 2 is entered into the apparatus 10, the primitives Pr x to Pr 3 forming the character IC are entered on the tablet 20 separately.
  • the data processor 14 samples the cartesian co-ordinate data generated by the tablet 20 for the first primitive P ⁇ and stores the sampled cartesian co-ordinate data point signals Pl ⁇ to Pl 5 in the memory 22 as shown in Figures 4a to 4c.
  • the processor 14 samples the cartesian co-ordinate data point signals P2 X to P2 8 and P3 X to P3 8 generated for the next two primitives Pr 2 and Pr 3 respectively and stores the sampled cartesian co ⁇ ordinate data point signals in the memory 22.
  • the cartesian co-ordinate data point signals are conveyed separately to the pre ⁇ processor 24 wherein they are stored in the comparator 24a.
  • the sampled cartesian co-ordinate data point signal Pl ⁇ for the first primitive Pr ⁇ is compared with the outer boundary cartesian.co-ordinates of the digitizer tablet 20. If the sampled data point signal is detected as being outside the boundary of the tablet 20, it is deleted.
  • each of the remaining data point signals Pl 2 to Pl 5 are compared with the previous data point signal Pl ⁇ . For example, if the distance between the data points Pl 2 and Pl ⁇ is less than a predetermined value, the data point signal Pl 2 is deleted and the data point signals Pl 3 is compared with the data point signal Pl ⁇ .
  • the data point signal Pl 1 is stored in 5 the memory 24b and the above-mentioned process is recommenced examining the data point signals Pl 3 and Pl 4 .
  • This process is performed for each data point signal sampled for the first primitive P ⁇ 1 until the co ⁇ ordinate data representing the inputted primitive P ⁇ 0 has been reduced.
  • This process is also performed on the sampled cartesian co-ordinate data point signals for each of the other entered primitives Pr 2 and Pr 3 and thus, the memory 24b stores the reduced cartesian co ⁇ ordinate data point signals for each of the entered
  • entered primitive are converted into a vector code and series of scalars in order to simplify the process of detecting the primitives that were entered on the tablet 20.
  • ordinate data is examined to detect whether it has been reduced to a single pair of co-ordinates by the pre ⁇ processor 24. This occurs if the. primitive Pr e is entered on the tablet 20. If this primitive is detected, the primitive code e is outputted to the
  • the feature extraction section 26 implements the use of a
  • SUBSTITUTE SHEET modified Freeman coding system FC which is illustrated in Figure 7 when forming the vector codes and scalars to determine the other primitives.
  • the Freeman coding system allows a series of cartesian co-ordinate data point signals (P 0 , P x , ... P ⁇ , P i + 1 ) where P 0 is equal to (X 0 , Y 0 ) and P ⁇ is equal to ( ⁇ , Y ⁇ ), to be converted into a series of unit vectors, each vector of which has an associated length.
  • the unit vectors are formed by comparing a line drawn between adjacent cartesian co-ordinate data point signals P ⁇ ⁇ and P i + 1 with one of the eight Freeman unit vectors FV ⁇ to FV 8 in the Freeman code FC.
  • the Freeman coding system FC uses a 20° tolerance for each of the Freeman unit vectors PV N and thus, allows any line formed between a pair of cartesian co-ordinate data point signals T? ⁇ and P i . x falling within one of the boundaries A ⁇ to A 8 to be assigned to the proper Freeman unit vector FV N associated with that boundary.
  • the pre-processed cartesian co-ordinate data point signals are conveyed to the comparator 26a.
  • the comparator 26a adjacent cartesian co-ordinate data point signals are examined and a line is formed therebetween.
  • SUB S TITUTESHEET sampled cartesian co-ordinate data; due to inadvertent movement of the stylus 20a by the operator, the -length of the line formed between each adjacent data point signal is compared with a threshold value. If the length is less than that of the predetermined threshold length, the second data point signal is assumed to be the result of a spurious hand movement by the operator and is thus deleted. This process ensures that a horizontal line drawn by an operator with a slight undesired non-horizontal portion will be filtered to produce data representing the desired horizontal line.
  • Freeman code FC If the line falls within one of the tolerance boundaries A t to A 8 , the Freeman unit vector FV ⁇ to FV 8 associated therewith is conveyed to the memory 26c. If the line formed between two cartesian co-ordinate data point signals falls within one of the invalid boundaries X- to X 8 in the Freeman code FC, the second cartesian co-ordinate data point signal is replaced by the next preceding cartesian co-ordinate data point signal and a new line is formed therebetween. Similarly, the new line is compared with the Freeman code FC once again to detect if the line lies within one of the valid boundaries A ⁇ to A 8 .
  • the Freeman unit vector FV jj associated with the boundary A j is conveyed to the memory 26c. However, if a valid Freeman unit vector is not detected, the second data point signal of the pair is replaced by the next preceding data point and the same process is repeated. If a line falling in a valid boundary is still not detected after substituting each of the remaining cartesian co-ordinate data points generated for the entered primitive, the cartesian co-ordinates are represented by an invalid Freeman unit vector U' and the invalid Freeman vector is conveyed to the memory 26c.
  • FV N or U' are formed for each of the entered primitives and are stored separately in the memory 26c.
  • the series of unit vectors are then separately conveyed to the comparator 26d.
  • the comparator 26d compares each unit vector FV i+1 with the previous unit vector FV ⁇ and if they are equivalent, a scalar count is incremented for that unit vector and the unit vector F ⁇ 7 ⁇ + 1 is deleted.
  • This process is performed on the unit vectors generated for each of the entered primitives Pr. This operation results in the formation of a reduced series of unit vectors or a vector code for each entered primitive forming the character, each vector code of which has an associated series of scalars, which represent the length of each of the unit vectors in the vector code.
  • the comparator 26a firstly examines the cartesian co ⁇ ordinate data points associated with the first primitive P ⁇ and forms the lines Ll x to Ll 4 between each adjacent data point Pl ⁇ to Pl 5 respectively.
  • the lines Ll ⁇ to Ll 4 are then compared with the Freeman code FC and the associated Freeman vectors FV ⁇ to FV N are assigned to the lines.
  • the primitive Pr ⁇ formed from cartesian co-ordinate data points Pl x to Pl 5 and generating lines Ll ⁇ to Ll 4 as illustrated in Figure 4 is assigned the Freeman vectors FV 3 , FV 3 , FV 3 , FV 3 since each of the lines Ll ⁇ to Ll 4 falls within the boundary A 3 (assuming that the length of each of the lines is above the threshold value).
  • Figure 4a is processed to form the series of Freeman vectors FV 3 , FV 3 , FV 3 , FV 3 , FV 3 , the comparator 26d reduces the series of vectors to the vector code FV 3 having a scalar of 4. If, for example, a primitive was entered and a series of Freeman vectors equal to FV 3 , FV 3 , FV 3 , FV 4 , FV 4 , FV 4 , FV 4 , FV 5 , FV 5 , FV 3 was generated therefor, the series of unit vectors would be reduced to the vector code FV 3 , FV 4 , FV 5 , FV 3 , and a series of scalars equal to 3, 3, 2, 1 would be generated.
  • the vector code and associated series of scalars for each primitive forming the entered character are conveyed to the primitive detection section 28.
  • the vector codes are applied to the comparator 28a and the series of scalars are stored in the memory 28c.
  • the vector codes received by the comparator 28a are compared with the vector codes stored in the primitive dictionary 28b.
  • the dictionary 28b is partitioned into sixteen primitive code sections, the first fifteen sections of which are uniquely associated with one of the fifteen primitives Pr a to Pr 0 and store
  • SUBSTITUTESHEET vector codes uniquely associated with that primitive.
  • the sixteenth section holds ambiguous vector codes which can represent more than one of the primitives.
  • the sixteenth section also holds unique test information for each ambiguous vector code to allow the correct entered primitives to be determined.
  • the primitive code for an entered primitive is equivalent to a vector code found in one of the first fifteen sections of the dictionary 28b
  • the primitive code a to o associated therewith is conveyed to the memory 30. This process is performed for each of the vector codes generated for each primitive forming the entered character. Thus, a series of primitive codes or a character code is generated, the character code of which represents the ideographic character entered on the digitizer tablet 20.
  • the test information associated with the ambiguous vector code is applied to the test section 28d.
  • the test section 28d receives the test information and examines it to determine which vector code is being examined. Thereafter, the test section 28d receives the series of scalars associated with the examined vector code and performs operations thereon, the operations of which are determined by the unique test information. The results of the tests are conveyed back to the dictionary 28b which in turn selects the correct primitive code that represents the entered primitive.
  • the series of scalars provide suitable information to discriminate between each ambiguous vector code since although the vector codes are ambiguous, the value of each scalar in the series are typically very different.
  • the vector code being compared with the contents of the dictionary 28b is not equivalent to a vector code located therein, the vector code is assigned an unidentified primitive code U which is similarly applied to the memory 30.
  • the output of the primitive detection section 28 comprises a series of primitive codes or a character code, which represents the inputted ideographic character IC.
  • the character code stored in the memory 30 is applied to the character code recognition section 32 and received by the comparator 32a.
  • the comparator 32a compares the character code with the contents of the character dictionary 32b generated for the entered character.
  • the dictionary 32b stores a character code for each of the possible ideographic characters in the language along with its
  • SUBSTITUTE SHEET corresponding international ASCII output code The international ASCII output code is used internationally to represent the ideographic character. Since a number of ideographic characters are formed from the same primitives entered in the same order, some ideographic characters have identical character codes although the relative positions between the entered primitives are very different. To allow the apparatus 10 to detect the proper ideographic character when an ambiguous character code is received, the character dictionary 32b also contains test information uniquely associated with each ambiguous character code.
  • a character code When a character code is received from the memory 30, it is compared with the contents of the dictionary 32b via comparator 32a. If received character code is equivalent to a character code found in the dictionary 32b that is uniquely associated with only one ideographic character, the international ASCII output code associated therewith is outputted from the dictionary 32b and stored in the memory 34. However, when the character code generated for the entered ideographic character is equivalent to an ambiguous character code that is associated with more than one ideographic character, the unique test information associated therewith is applied to the character differentiator 32c.
  • the differentiator 32c Upon reception of the test information, the differentiator 32c retrieves the unprocessed cartesian co-ordinate data from the memory 22 and performs operations thereon as determined by the test information in order to determine the international ASCII output code that represents the inputted ideographic character. When performing the test operations, the unprocessed cartesian co-ordinate data points are used ' as opposed to
  • a character code equal to "aba" would be generated and compared with the contents of the dictionary 32b.
  • the character code would be detected as being ambiguous since the ideographic characters IC2 and IC3 shown in Figures 9a and 9b respectively are also represented by the same character code "aba".
  • the unique test information associated with the character code "aba” would be applied to the differentiator 32c, along with the unprocessed cartesian co-ordinate data from the memory 22. For this example, the test information would cause the differentiator 32c to examine the position of the second primitive Pr 2 with respect to the first primitive Pr ⁇ to determine if the second primitive Pr 2 passes through the first primitive Pr x .
  • the differentiator 32c would acknowledge that the entered ideographic character IC is not equivalent to ideographic character IC2 since this feature is not present in the character IC2.
  • the third primitive Pr 3 is compared with the first primitive Pr ⁇ forming the entered ideographic character IC and the relative sizes therebetween are examined. The result of this test enables the differentiator 32c to select the correct international ASCII output code since the primitive P ⁇ is smaller than the primitive Pr 3 .
  • the dictionary 32b receives the results generated by the differentiator 32c and the correct international ASCII output code is conveyed to the memory 34.
  • the international ASCII output code After the international ASCII output code has been determined and stored in the memory 34, it can be applied to output devices such as a printer 16a, a VDT terminal 16b or an audio synthesizer 16c in order to produce an image of the inputted ideographic character.
  • output devices such as a printer 16a, a VDT terminal 16b or an audio synthesizer 16c in order to produce an image of the inputted ideographic character.
  • the substitution and correction section 36 includes the probability matrix 36a, which is in the form of a sixteen row by fifteen column array of registers 36 a ' . As shown in Figure 10, each row of the matrix is associated with one of the possible sixteen primitive codes a to o including the unidentified primitive code U and each of the columns of the matrix is associated with one of the fifteen possible primitive codes a to o.
  • Each of the registers 36 a ' holds a number representing the probability that the primitive code of the row could be mistaken for the primitive code of the column.
  • the probability values stored in the registers along the left to right diagonal of the matrix 36a all have values of 1 since the probability that a primitive code will be detected as itself is high.
  • the probability of two very dissimilar primitives being mistaken for one another is highly improbable and thus, the probability values stored in a register associated with two dissimilar primitives is typically zero.
  • the probabilities in the row associated with the primitive code U are examined.
  • the primitive code of the column is used to replace the unidentified primitive code U.
  • the resultant character code is conveyed back to the comparator 32a and is compared with the contents of the character dictionary 32b to detect if the resultant character code is equivalent to a character code found therein. If the resultant character code is equivalent to a character code in the dictionary, the international ASCII output code is retrieved from the dictionary 32b and conveyed to the memory 34 wherein it is stored. If the resultant input character code is equivalent to an ambiguous character code, tests are performed on the cartesian co ⁇ ordinate data stored in the memory 22 in the same manner as previously described to determine the correct international ASCII output code.
  • the character code is conveyed to the comparator 36b and examined to identify the number of primitive codes forming the character code.
  • each character code in the character dictionary 32b formed from the same number of primitive codes is conveyed to the comparator 36b and compared with the unidentified character code. During this comparison, the number of differences between the primitive codes forming each of the character codes and the primitive codes forming the unidentified character code are examined. If the number of differences detected between the character code and the unidentified character code is greater than a threshold value, the character. code is discarded.
  • the international ASCII output code associated therewith is stored in the memory 36c.
  • the order of the international output codes stored in the memory 36c is chosen so that the first international ASCII output code in the memory is associated with the character code most similar to the unidentified character code.
  • the international output codes stored in the memory 36c are then retrieved from the memory 36c and conveyed to the
  • the VDT terminal thereby displaying to the user each of the ideographic characters that are most likely to be equivalent to the entered ideographic character.
  • the user may then choose the ideographic character corresponding to the ideographic character that was entered into the apparatus 10 via suitable editing software. If the substitution section 36 does not produce the desired ideographic character, editing programs can be used to retrieve the correct international ASCII output code from the dictionary 32b.
  • the ideographic character signals stored in the memory 34 can be coupled to the printer 16a to allow a reproduction of the inputted ideographic character to be generated. Furthermore, the character signals can be conveyed to the VDT screen 16b to allow the user to view the characters that has been entered into the apparatus 10.
  • the apparatus 10 is also capable of functioning with known editing programs to allow the user to change the ideographic character signals stored in the memory 34.
  • the same set of primitives shown in Figure 3 are used to form the characters. It should be apparent that the primitives shown in Figure 3 are particularly useful in forming ideographic and upper case English language characters since all of the characters in these languages can be formed from these primitives. However, it should be appreciated that other primitives may have to be added so that all of the characters in all languages can be formed, however, this will be rare since the twenty primitives should be capable of forming substantially all of the characters in every language.
  • the dictionaries in the processor 14 are partitioned with each partition holding the various primitive codes, character codes and ASCII output codes for each upper case character in the other languages.
  • the upper case characters are stored in the apparatus since these characters are typically written in the same manner and order by everyone versed in the language.
  • the various sections in the processor also include test information to allow different characters which generate the same character code to be recognized.
  • the primitive detection and primitive code determination is performed in the same manner previously described using the Freeman coding except when one of the primitives Pr p to Pr t are entered on the tablet 20. Accordingly, When a primitive is entered on the tablet 20, the feature extraction section 26 examines the tangents of the lines formed between the sampled points along the primitive to determine the degree of curvature of the primitive (ie. 180°, 270°, 360°) prior to using the Freeman Coding.
  • the primitive code s or t associated with the entered primitive Pr s or Pr t is immediately determined without further processing. If the curvature of the primitive is detected as being approximately 180° , the starting and ending co-ordinate data signals of the primitive are examined along with the direction of the tangents (ie. clockwise or counter-clockwise) This allows the primitives Pr to Pr r to be differentiated without requiring further processing. Other wise if the entered primitive is not detected as having a substantially constant gradient when examining the tangents, the pre- processed co-ordinate data signals are processed using the Freeman coding to determine the correct primitive code.
  • the method of detecting the handwritten characters is the same although the apparatus must be conditioned to the appropriate mode via means 18. This is even necessary for languages like German ,French and English wherein the characters forming the language are the same .since the ASCII output codes therefor are different.
  • the substitution matrix can also be used for each of the other languages although it is not necessary due to the small number of characters used in non-ideographic languages.
  • the device when the apparatus 10 is conditioned to detect upper characters of a language, the device is also included with software for outputting the ASCII code for the lower case equivalent of the detected upper case character if desired.
  • the lower case letters can be detected in a similar manner to the upper case letters, lower case letters are typically written differently by individuals thereby making the detection process more difficult and requiring more memory space to permit detection of the character in the many possible ways that it can be written.
  • the present apparatus has been employed in an IBM PC XT personal computer manufactured by International Business Machines provided with a 20 Mb hard disk which functions to store the information for the dictionaries.
  • the computer is supplied with the appropriate software which allows the input cartesian co-ordinate data point signals to be processed in the above-mentioned manner.
  • a B-tree algorithm which is well known in the art is used to increase the speed of the detection between the character code generated for the inputted ideographic and the character codes stored therein.
  • the B-tree algorithm increases processing speed, it also increases memory requirements, since indexing files are required.
  • the present apparatus 10 can also be manufactured on a small integrated circuit board capable of being coupled to a conventional personal computer, the board of which is provided ROM components to store
  • the present apparatus provides the advantages of being able to distinguish between characters which are formed from the same primitives entered in the same order. This decreases the occurrences of an operator having to halt data entry operations in order to choose the correct ideographic character. Moreover, the substitution means further decreases the above-mentioned occurrence since allowing the present apparatus to choose a different character code that is most similar to the entered character code, if the input character is not found in the apparatus 10. Furthermore, since the apparatus can be generated using software or manufactured using hardware components, the apparatus is versatile and can be used in various environments.
  • the present device also provides further advantages in that the manner in which the entered strokes are processed in the apparatus, allows the strokes to be written substantially anywhere on the tablet surface except for the small number of characters which generate an ambiguous character code. Also, the processing used prior to the determination the primitives forming the character allows the entered characters to be determined irrelevant of the length of the entered primitives except for a few exceptions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Character Discrimination (AREA)

Abstract

Dans un appareil et un procédé servant à identifier des caractères, chacun des caractères fait partie d'un groupe et est formé à partir d'un certain nombre d'éléments primitifs prédéterminés. Ledit appareil comprend un dispositif d'entrée recevant successivement chaque élément primitif formant un caractère. Le dispositif d'entrée produit des signaux d'entrée pour chaque élément primitif formant le caractère. Les signaux d'entrée sont acheminés vers un processeur. Celui-ci examine les signaux d'entrée et tente d'identifier chacun des éléments primitifs utilisés pour former le caractère. Un code d'éléments primitifs est généré pour chaque élément primitif identifié et un code d'éléments primitifs non identifiés est généré pour chaque élément primitif non identifié. Le code des éléments primitifs et le code des éléments primitifs non identifiés sont combinés pour former un code de caractère d'entrée. Une mémoire stocke un code de caractère et un code de sortie international pour chacun des caractères compris dans le groupe de caractères. Un comparateur compare le code de caractère d'entrée généré pour le caractère introduit avec chacun des codes de caractère stockés dans la mémoire. Lorsque le code de caractère d'entrée est équivalent à un code de caractères associés avec un seul code de sortie, le code de sortie est acheminé vers un dispositif de sortie, tel qu'une imprimante, dans lequel une reproduction du caractère entré est faite. Lorsque le code de caractère est équivalent à un code de caractère associé à plus d'un code de sortie, un différenciateur détecte le code de sortie correct associé au code de caractères d'entrée, pour permettre la reproduction du caractère entré.In an apparatus and method for identifying characters, each of the characters is part of a group and is formed from a number of predetermined primitive elements. The apparatus includes an input device successively receiving each primitive element forming a character. The input device produces input signals for each primitive element forming the character. The input signals are routed to a processor. This examines the input signals and attempts to identify each of the primitive elements used to form the character. A code of primitive elements is generated for each identified primitive element and a code of unidentified primitive elements is generated for each unidentified primitive element. The code for the primitive elements and the code for the unidentified primitive elements are combined to form an input character code. A memory stores a character code and an international exit code for each of the characters included in the character group. A comparator compares the input character code generated for the character entered with each of the character codes stored in the memory. When the input character code is equivalent to a character code associated with a single exit code, the exit code is routed to an output device, such as a printer, in which reproduction of the entered character is made. . When the character code is equivalent to a character code associated with more than one exit code, a differentiator detects the correct exit code associated with the input character code, to allow reproduction of the entered character.

Description

CHARACTER RECOGNITION APPARATUS
The present invention relates to an apparatus and method for identifying characters.
Since trade between Non-English speaking 5 countries and Western countries has increased dramatically, the importance of communications has increased. For example, in the past when corresponding between Englisn and Chinese speaking countries, a document written in English that was received in China
10 would firstly be forwarded to a government translation centre. The document would then be translated and transcribed by hand into Chinese and finally delivered to the addressee of the document. When a response to the translated document was prepared, the response would
1.5 be translated from Chinese into English at the government translation centre and forwarded to the English correspondent. However, a problem existed in that the use of translators to transcribe the documents from English to Chinese and vice versa added a
20 significant delay in the communications process.
To overcome these difficulties, a typewriter device has been developed having keys representing the ideographic characters of the Chinese language. This
25 device allows hard copies of documents written in
Chinese to be produced by hiring an operator skilled in the Chinese language and capable of using the typewriter. However, a problem exists in that a large number of keys are required on the typewriter device
3D since the Chinese language includes more than 50,000 different ideographic characters. Improvements to this type of device have been introduced to reduce the number of keys required by using function keys, however, the above-mentioned problem still exists. Furthermore,
35 another problem exists when using the typewriter devices in that extensive training is required for the operators
SUBSTITUTE SHEET to learn how to use adequately the keyboard device, a process which is expensive and time consuming.
To overcome the problems encountered when using the keyboard devices, an ideographic character detection apparatus has been developed for receiving and identifying handwritten ideographic characters. The apparatus requires that the ideographic character be written on an input device and that the written characters be formed from predetermined fundamental strokes or primitives which are typical strokes used by everyone who writes in the ideographic language. After an ideographic character has been entered into the apparatus, the apparatus examines the primitives forming the entered ideographic character and compares the entered primitives with the contents of a look-up table. The look-up table stores a plurality of variations of each of the predetermined primitives to accommodate variations in user's handwriting. Due to the large number of variations of each primitive stored in the table, the primitives forming the character are usually determined by the device. The table also stores the sets of primitives used to form each of the characters in the ideographic language. If the set primitives forming the entered character corresponds with one of the sets of primitives in the look-up table, an output code associated with the set of primitives is generated and conveyed to an output device. This allows a hard copy image of the entered ideographic character to be formed. However, a problem exists in that due to the large number of variations of each primitive stored in the table, the processing speed of the apparatus is greatly reduced making it unsuitable for real-time applications.
SUBSTITUTE SHEET Moreover, the number of predetermined fundamental strokes or primitives used in-this apparatus has typically been chosen to be five or less or twenty or more. By using only five fundamental primitives in 5 the sub-set to form every ideographic character in the language, a problem exists in that a large number of different ideographic characters are formed from the identical set of primitives even though the ideographic characters are unique in appearance. This results in 10 the decreased ability of the apparatus to distinguish between different ideographic characters.
To attempt to overcome this problem, twenty or more distinct primitives have been included in the sub-
15 set. However, the same problem still exists in that different ideographic characters are still formed from the identical series of primitives although the occurrence of a series of primitives representing more than one ideographic character is reduced. However, by
20 increasing the number of primitives in the sub-set, another problem exists in that the processing time of the apparatus is further increased.
Furthermore, still yet another problem exists 25 in that typically these devices are capable of detecting characters written in one language and do not permit multi-language character detection. Accordingly, there is a need for an improved character recognition apparatus. 30
It is therefore an object of the present invention to obviate or mitigate the above disadvantages.
3.5 According to the present invention there is provided a character recognition apparatus for
SUBS I U E SHE T identifying characters formed from a number of primitives, said characters and primitives being members of predetermined sets, said apparatus comprising: input means for receiving successively each of the primitives forming said character and generating input signals for each of said received primitives; processing means receiving said input signals and identifying each of said primitives received by said input means, said processing means generating a character code representing said character upon identification of said primitives; storage means storing a character code and an associated output code for each of the characters in said set; comparing means comparing said character code generated for said entered character with each of said character codes in" said storage means to identify said entered character; and output means in communication with said comparison means and generating a reproduction of said entered character upon the identification thereof by said comparison means.
Preferably, the apparatus further includes differentiation means examining said input signals generated for each of said primitives and performing operations thereon, when said character code is equivalent to a character code associated with a plurality of output codes to identify the output code associated with said character.
Preferably the apparatus is provided with substitution means for selecting the character code stored in the storage means having the highest probability of being equivalent to the character code generated for the entered character, when the input
BSTrruTHSHser character code is not equivalent to any of the character codes stored in the storage means. It is also preferred that the output means comprises at least one device chosen form the group comprising a printer, audio synthesizer or video display terminal to allow a reproduction of the received ideographic character to be formed or an audio reproduction of the ideographic character to be produced.
Preferably, the character recognition apparatus is capable of recognizing characters written in all ideographic languages, upper case English language characters, and Russian characters.
It is also desirable that the predetermined set of fundamental primitives is chosen to comprise 20 unique primitives, the various combinations of which Will form substantially all characters in a plurality of different languages, whilst decreasing the occurrence of different characters being formed from the same series of primitives. Thus, the use of twenty distinct primitives decreases the occurrence of entered characters being represented character codes which are equivalent to a character code associated with more than one international output code. This of course, increases the probability of detecting the correct ideographic character.
An embodiment of the present invention will now be described, by way of example only, with reference to the accompanying drawings in which:
Figure 1 is a functional block diagram of an apparatus for identifying characters;
Figure 2 is an illustration of an ideographic character;
SUBSTITUTE SHEET Figure 3 are illustrations of the fundamental primitives used in the device illustrated in Figure 1;
Figures 4a to 4c is an illustration of the method of forming the character shown in Figure 2 from the primitives shown in Figure 3;
Figure 5 is a more detailed functional block diagram of the device illustrated in Figure 1;
Figure 6 is a detailed functional block diagram of a portion of the device illustrated in Figure 1;
Figure 7 is an illustration of a coding method used in the device illustrated in Figure 1;
Figures 8a and 8b are illustrations of entered fundamental strokes; Figures 9a and 9b are illustrations of still more ideographic characters;
Figure 10 is an illustration of a probability matrix used in the device illustrated in Figure 1;
Figure 11 is an illustration of an English character; and
Figure 12 is an illustration of more English characters.
Referring to Figure 1, an apparatus 10 for identifying handwritten characters is shown. The apparatus 10 comprises an input device 12 connected to a data processor 14. The input device 12 receives the handwritten character and converts the character into a series of signals that are conveyed to the data processor 14. The data processor 14 processes the received signals in order to detect the character entered on the input device 12. An output device 16 is also connected to the data processor 14 and receives an international ASCII output code representing the handwritten character that was received by the input
SOBSTVTUTE^ device 12. This allows a reproduction of the handwritten character to be generated.
The apparatus 10 is operable in a number of 5 modes, each mode of which allows handwritten characters of a different language to be recognized and reproduced. Selection means 18 are provided to allow a user to select the language in which the apparatus 10 is to operate. Thus, the processing means 14 is responsive to
10 the selection means 18 and is partitioned into sections 14a, 14b,..., 14n so that appropriate information for each language is separately stored and accessible depending on the mode selected by the selection means 18.
15
For simplicity, the apparatus shown in Figure 1 will be described when the processing means 14 is conditioned to detect ideographic characters, although it should be realized that characters in other languages
20 can be detected in a similar manner by conditioning the selection means 18 to a different mode.
Referring to Figure 2, an ideographic character IC is shown. As can be seen, the ideographic
2.5 character IC is formed from a number of fundamental strokes or primitives, the primitives being labelled as Pr- to Pr3 respectively. The primitives Prx to Pr3 are fundamental strokes used when writing in the ideographic language.
30
The writing order of the sequence of strokes for ideographic characters is mainly based on logic efficiency, experience and natural human habits. According to several research findings, there exist a
35 number of basic rules when writing ideographic characters and they are as follows:
SUBSTITUTESHEET up - down left - right out - in horizontal - vertical left slant - right slant first enter - last close.
Each Chinese character may employ one or more of the above rules in the formation of the character. Examples of basic stroke sequences of ideographic characters are illustrated in Table 1 hereinbelow:
TABLE 1
To decrease the number of primitives that a user must be required to write when forming an ideographic character and to reduce the amount of data that has to be processed by the processor 14, fifteen of the twenty primitives Pra to Pr0 illustrated in Figure 3 are used by the apparatus 10. The fifteen primitives Pra to Pr0 are members of the set of fundamental strokes typically used in the formation of ideographic characters. This sub-set of primitives is chosen since all of the ideographic characters in the various languages can be formed from various combinations of the primitives Pra to Pr0. The primitives Pr to Prt are used with some of the primitives Pra to Prσ when- the apparatus is operating to detect characters written in another language as will be described.
Referring now to Figure 5, the apparatus 10 is better illustrated. The input device 12 comprises an on-line digitizer tablet 20 having a stylus 20a. The ideographic character to be recognized is written on the tablet 20 with the stylus 20a. This causes a series of cartesian co-ordinate data point signals PN0 to PNN to be generated for each of the primitives Pra to Pr0 entered that form the ideographic character IC. The upper case "N" of the data point signal refers to the order in which the primitive was entered when forming the character IC while the subscript "N" refers to the number of the sampled point along the primitive. The data point signals are then conveyed to the data processor 14.
A memory 22 is located in the data processor 14 and is connected to the digitizer tablet 20. The memory 22 receives the raw cartesian co-ordinate data point signals and stores them prior to processing. A pre-processor 24 receives a copy of the cartesian co¬ ordinate data point signals PN0 to PNN for each entered primitive and processes the data to remove redundant and spurious data. The pre-processed cartesian co-ordinate data signals are conveyed from the pre-processor 24 to a feature extraction section 26 which converts the cartesian co-ordinate data point signals for each of the entered primitives Pr into a vector code and a series of scalars.
The vector code and series of scalars generated by the feature extraction section 25 are
SUBSTITUTESHEET applied to a primitive detection section 28 which compares the vector code generated for each entered primitive Pra to Pr0 forming the character IC with the contents of a look-up table or dictionary. This allows the processor 14 to detect whether the entered primitives are members of the fifteen primitives Pr. to Pr0. When an entered primitive Pr results in the formation of a vector code equivalent to a vector code associated with only one of the fifteen primitives stored in the primitive detection section 28, a primitive code a to o is generated and conveyed to a memory 30. This process is performed for each vector code representing each primitive Pr forming the entered ideographic character IC. Thus, a series of primitive codes or a character code is generated for the entered character which represents the ideographic character IC. However, if a vector code generated for an entered primitive Pr is equivalent to a vector code associated with more than one of the ifteen primitives Pra to Pr0 , the detection section 28 performs tests on the series of scalars associated with the generated vector code to detect the correct entered primitive.
The generated character code is conveyed from the memory 30 to a character detection section 32 and compared with the contents of a second look-up table or dictionary. Section 32 stores the character code representing each of the ideographic characters in the language. The stored character codes are based on the requirement that the ideographic characters are formed from a combination of the fifteen primitives illustrated in Figure 3 and that the characters are entered on the tablet 20 in an order as determined by the previously mentioned rules. Since the previously mentioned rules are generally used when writing in an ideographic language, character codes which can represent
SUBSTITUTE SHEET ideographic characters, but are formed from primitives entered in an incorrect manner are omitted fromthe look-up table.
When the character code generated for the entered ideographic character IC is equivalent to a character code found in the character detection section 32, an associated output code or international ASCII output code is outputted to a memory 34. However, if the character code is equivalent to a character code representing more than one ideographic character, the character detection section 32 performs operations on the raw cartesian co-ordinate data point signals stored in the memory 22 to determine the correct ideographic character IC to which the character code represents.
This allows the correct international ASCII code to be outputted to the memory 34.
A substitution and correction means 36 is also provided and examines the entered character code when it is not equivalent to a character code stored in the character detection section 28. The substitution means 36 substitutes for the entered character code, the most probable character code that the entered character code was supposed to represent and conveys it back to the character detection section 32 wherein the above- mentioned process is performed.
The international ASCII code representing the ideographic character IC stored in the memory 34 is applied to the output device or devices 16 which typically include a video display terminal (VDT) 16a, printer 16b and/or a video synthesizer 16c wherein an audio and/or visual reproduction of the ideographic character IC can be formed. Referring to Figure 6, the processing means 14 is better illustrated. The pre-processor 24 comprises a comparator 24a and a memory 24b which function in a manner to be described to eliminate redundant and 5 spurious cartesian co-ordinate data point signals. The feature extraction section 26 includes a second comparator 26a and a look-up table or dictionary 26b which, function to generate vectors for adjacent cartesian co-ordinate data point signals forming each 0 primitive Pr. A memory 26c receives the vectors and in turn conveys the vectors to a third comparator 26d. The comparator 26d examines the vectors and removes redundant information to form a series of unit vectors or a vector code for each primitive Pr and a series of 5 scalars. The scalars represent the length of each unit vector in the vector code generated for each primitive. The vector code and series of scalars generated for each primitive Pr are conveyed to a memory 26e and stored prior to being conveyed to the primitive detection 0 section 28.
The primitive detection section 28 includes a fourth comparator 28a connected to a second look-up table or dictionary 28b. The table 28b stores a list of
25. predetermined vector codes and a primitive code for each primitive Pra to Pr0. The vector codes represent one or more of the fifteen primitives Pra to Pr0. The primitive detection section 28 also comprises a memory 28c which holds the scalars generated for each vector
30 code and a test section 28d. The test section 28d performs operations on the series of scalars if the vector code associated therewith is equivalent to a vector code which represents more than one of the fifteen primitives. This allows the correct primitive
35 . to be determined. When the vector code for each of the entered primitives Pr is located in the dictionary 28b,
SUBSTITUTE SHEET the primitive code a to o associated therewith is applied to the memory 30.
The series of primitive codes or character code generated for the entered ideographic character IC is conveyed to the character detection section 32 which comprises a fifth comparator 32a and a third look-up table or dictionary 32b. The dictionary 32b stores a list of the character codes forming each of the ideographic characters in the language and an associated international output code. The comparator 32a and the dictionary 32b function to detect whether the character code representing the entered ideographic character IC is equivalent to a character code representing one or more of the ideographic characters. The character detection section 32 also includes a differentiator 32c which performs tests on the raw cartesian co-ordinate data point signals if the character code is equivalent to a character code which represents more than one ideographic character. This allows the correct ideographic character to be detected. When the correct ideographic character has been identified, the international ASCII code associated therewith is conveyed to the memory 34 and in turn to the output device 16.
As mentioned previously, when the character code is not equivalent to a character code found in the dictionary 32b, the substitution and correction means 36 is used. The substitution section 36 includes a probability matrix 36a, a sixth comparator 36b and a memory 36c which collectively function to determine the most probable character code that the character code generated for the entered ideographic character IC was supposed to be. This increases the probability of
SUBSTITUTESHEET detecting the ideographic character IC entered on the digitizer tablet 20.
When an ideographic character IC is to be entered into the apparatus 10 via the digitizer tablet 20, the stylus 20a is placed on the tablet 20 and each of the primitives Pr forming the ideographic character IC is drawn separately. As described hereinabove, the primitives used to form the ideographic character IC must be substantially equivalent to one of the fifteen primitives Pr. to Pr0. However, this limitation does not pose many problems since each of the fifteen primitives are fundamental strokes used by substantially everyone who is capable of writing in an ideographic language. Furthermore, the primitives Pra to Pr0 are chosen to reduce the number of entered characters that generate the same character code when inputted into the apparatus 10 and to simplify processing in section 14. After a primitive Pr has been entered, the stylus 20a is removed from the tablet 20 for a predetermined length of time. This results in a time-out signal being generated which allows the data processor 14 to recognize that the primitive Pr has been completely entered. Thereafter, the next primitive forming the character is entered and a time-out signal is generated. This process continues until each primitive forming the character has been entered into the apparatus 10.
As the stylus 20a is moved across the tablet 20 to form a primitive Pr, a series of cartesian co¬ ordinate data point signals are generated. The data processor 14 samples the cartesian co-ordinate data point signals generated for each primitive at a sampling rate of approximately 100 samples per second and stores the sampled co-ordinate data signals in the memory 22. The sampled data for each primitive is continuously
SUBSTITUT stored in separate registers until the data processor 14 receives a time-out signal signifying that the complete primitive has been entered. While: the next primitive Pr2 is being formed on the tablet 20, the sampled cartesian co-ordinate data point signals are separately stored in different registers in the memory 22 until the next time-out signal is detected by the processor 14. This process continues until each primitive forming the ideographic character has been entered and the cartesian co-ordinate data signals generated therefor have been stored separately in the memory 22. To indicate to the data processor 14 that the entire ideographic character IC has been entered, an end-of-character (EOC) key located on the tablet must be depressed. This prevents further data entered on the tablet 20 from corrupting the data associated with previously entered ideographic character.
Since a digitizer tablet 20 is used, temporal and irregular noise occurs during the sampling process due to miscoupling of the stylus 20a and the digitizer tablet surface 20. Furthermore, small amplitude noise occurs due to uneven movements in the operator's hand which introduces discrepancies between the sampled cartesian co-ordinate data point signals and the desired cartesian co-ordinate data point signals. Also, the slow movement of the stylus 20a across the digitizer tablet surface 20a with respect to the sampling rate of the processor 14 introduces a large number of redundant data point signals which in turn requires a large amount of storage space and increases the processing time of the apparatus 10. Thus, as mentioned previously, the pre-processor 24 is used to reduce the redundant and spurious data.
SUBSTITUTE SHEET To perform this function, a copy of the sampled cartesian co-ordinate data point signals is applied to the comparator 24a. To reduce the noise caused by the inadvertent decoupling of the stylus 20a and the digitizer tablet 20, the sampled cartesian co¬ ordinate data point signals are separately analyzed. If *any sampled cartesian co-ordinate data point signal is detected as having a set of co-ordinates extending beyond the boundary of the digitizer tablet 20, the cartesian co-ordinate data point signal is deleted. Secondly, to reduce the amount of redundant data and hence, to increase the processing speed of the apparatus 10, the first two cartesian co-ordinate data point signals are compared in the comparator 24a. If the distance between the two cartesian co-ordinate data point signals is less than a predetermined threshold value, the second sampled data point signal is deleted and the distance between the first and the third sampled cartesian co-ordinate data point signals is examined. This process continues until the distance between two data point signals is greater than the threshold value. When, the distance is greater than the threshold value, the first data point signal is conveyed to the memory 24b and the other data point signal is compared with the next preceding data point signal.
Furthermore, if the distance between the two cartesian co-ordinate data point signals is greater than a second predetermined threshold value, the second cartesian co-ordinate data point signal is compared with the third data point signal. If the distance between the second and third data point signals is larger than the second threshold value, the second data point signal is assumed to have been generated due to an inadvertent miscoupling of the stylus 20a and the tablet 20 and is deleted. However, if the distance between the second data point signal and the third data point signal is less than the second threshold value, the first data point signal is assumed to have been generated inadvertently and is deleted. This process is performed on the sampled cartesian co-ordinate data point signals for each of the entered primitives forming the entered character and hence, reduces the amount of data that requires processing.
For example, if the ideographic character IC illustrated in Figure 2 is entered into the apparatus 10, the primitives Prx to Pr3 forming the character IC are entered on the tablet 20 separately. The data processor 14 samples the cartesian co-ordinate data generated by the tablet 20 for the first primitive P^ and stores the sampled cartesian co-ordinate data point signals Plχ to Pl5 in the memory 22 as shown in Figures 4a to 4c. Similarly, the processor 14 samples the cartesian co-ordinate data point signals P2X to P28 and P3X to P38 generated for the next two primitives Pr2 and Pr3 respectively and stores the sampled cartesian co¬ ordinate data point signals in the memory 22.
Following this, the cartesian co-ordinate data point signals are conveyed separately to the pre¬ processor 24 wherein they are stored in the comparator 24a. Firstly, the sampled cartesian co-ordinate data point signal Plχ for the first primitive Prχ is compared with the outer boundary cartesian.co-ordinates of the digitizer tablet 20. If the sampled data point signal is detected as being outside the boundary of the tablet 20, it is deleted. Secondly, each of the remaining data point signals Pl2 to Pl5 are compared with the previous data point signal Pl^ . For example, if the distance between the data points Pl2 and Plχ is less than a predetermined value, the data point signal Pl2 is deleted and the data point signals Pl3 is compared with the data point signal Plχ . If the- distance between the data point signals Pl3 and Pl-^ is greater than the threshold value, the data point signal Pl1 is stored in 5 the memory 24b and the above-mentioned process is recommenced examining the data point signals Pl3 and Pl4. This process is performed for each data point signal sampled for the first primitive Pτ1 until the co¬ ordinate data representing the inputted primitive P^ 0 has been reduced. This process is also performed on the sampled cartesian co-ordinate data point signals for each of the other entered primitives Pr2 and Pr3 and thus, the memory 24b stores the reduced cartesian co¬ ordinate data point signals for each of the entered
15 primitives.
When the spurious and redundant sampled cartesian co-ordinate data point signals for each entered primitive have been removed, the resultant data 20. point signals are conveyed from the memory 24b to the feature extraction section 26.
In the feature extraction section 26, the cartesian co-ordinate data point signals for each
25: entered primitive are converted into a vector code and series of scalars in order to simplify the process of detecting the primitives that were entered on the tablet 20. However, prior to forming the vector code and scalars for the entered primitive, the cartesian co-
30. ordinate data is examined to detect whether it has been reduced to a single pair of co-ordinates by the pre¬ processor 24. This occurs if the. primitive Pre is entered on the tablet 20. If this primitive is detected, the primitive code e is outputted to the
35 memory without requiring any further processing. The feature extraction section 26 implements the use of a
SUBSTITUTE SHEET modified Freeman coding system FC which is illustrated in Figure 7 when forming the vector codes and scalars to determine the other primitives. The Freeman coding system allows a series of cartesian co-ordinate data point signals (P0 , Px , ... P± , Pi + 1) where P0 is equal to (X0 , Y0 ) and P± is equal to ( χ , Y± ), to be converted into a series of unit vectors, each vector of which has an associated length. The unit vectors are formed by comparing a line drawn between adjacent cartesian co-ordinate data point signals P^^ and Pi + 1 with one of the eight Freeman unit vectors FVχ to FV8 in the Freeman code FC.
However, due to angles introduced into the shape of the entered primitives on the digitizer tablet 20, a tolerance is required to allow a line formed between a pair of cartesian co-ordinate data point signals P± and Pi+1 that is not coincident with a Freeman unit vector FVN to be assigned to the correct Freeman unit vector. To accommodate these drawing variations of the entered primitives, the Freeman coding system FC uses a 20° tolerance for each of the Freeman unit vectors PVN and thus, allows any line formed between a pair of cartesian co-ordinate data point signals T?± and Pi . x falling within one of the boundaries Aχ to A8 to be assigned to the proper Freeman unit vector FVN associated with that boundary.
To generate the Freeman unit vector FVN for each line formed between each adjacent cartesian co¬ ordinate data point signals for each of the primitives, the pre-processed cartesian co-ordinate data point signals are conveyed to the comparator 26a. In the comparator 26a, adjacent cartesian co-ordinate data point signals are examined and a line is formed therebetween. To reduce the errors introduced in the
SUBSTITUTESHEET sampled cartesian co-ordinate data; due to inadvertent movement of the stylus 20a by the operator, the -length of the line formed between each adjacent data point signal is compared with a threshold value. If the length is less than that of the predetermined threshold length, the second data point signal is assumed to be the result of a spurious hand movement by the operator and is thus deleted. This process ensures that a horizontal line drawn by an operator with a slight undesired non-horizontal portion will be filtered to produce data representing the desired horizontal line.
After the removal of inadvertent data point signals, lines are formed between the remaining adjacent data point signals and compared with the modified
Freeman code FC. If the line falls within one of the tolerance boundaries At to A8 , the Freeman unit vector FVχ to FV8 associated therewith is conveyed to the memory 26c. If the line formed between two cartesian co-ordinate data point signals falls within one of the invalid boundaries X- to X8 in the Freeman code FC, the second cartesian co-ordinate data point signal is replaced by the next preceding cartesian co-ordinate data point signal and a new line is formed therebetween. Similarly, the new line is compared with the Freeman code FC once again to detect if the line lies within one of the valid boundaries Aχ to A8. If the resultant line alls within a valid boundary Aj,, the Freeman unit vector FVjj associated with the boundary Aj, is conveyed to the memory 26c. However, if a valid Freeman unit vector is not detected, the second data point signal of the pair is replaced by the next preceding data point and the same process is repeated. If a line falling in a valid boundary is still not detected after substituting each of the remaining cartesian co-ordinate data points generated for the entered primitive, the cartesian co-ordinates are represented by an invalid Freeman unit vector U' and the invalid Freeman vector is conveyed to the memory 26c.
Thus, a series of Freeman unit vectors FVt to
FVN or U' are formed for each of the entered primitives and are stored separately in the memory 26c. The series of unit vectors are then separately conveyed to the comparator 26d. The comparator 26d compares each unit vector FVi+1 with the previous unit vector FV± and if they are equivalent, a scalar count is incremented for that unit vector and the unit vector F\7± + 1 is deleted. This process is performed on the unit vectors generated for each of the entered primitives Pr. This operation results in the formation of a reduced series of unit vectors or a vector code for each entered primitive forming the character, each vector code of which has an associated series of scalars, which represent the length of each of the unit vectors in the vector code.
For example, if the ideographic IC illustrated in Figures 1 and 4 is entered into the apparatus 10, the comparator 26a firstly examines the cartesian co¬ ordinate data points associated with the first primitive P^ and forms the lines Llx to Ll4 between each adjacent data point Plχ to Pl5 respectively. The lines Llχ to Ll4 are then compared with the Freeman code FC and the associated Freeman vectors FV± to FVN are assigned to the lines. Thus, the primitive Prχ formed from cartesian co-ordinate data points Plx to Pl5 and generating lines Llχ to Ll4 as illustrated in Figure 4 is assigned the Freeman vectors FV3 , FV3 , FV3 , FV3 since each of the lines Llχ to Ll4 falls within the boundary A3 (assuming that the length of each of the lines is above the threshold value).
SUBSTITUTE SHEET With each of the vectors generated for the primitive Prχ , the series of vectors are conveyed to the memory 26c and stored therein. The above described process is then performed on the cartesian co-ordinate data points associated with the primitives Pr2 and Pr3 and resultant vectors formed therefor are also conveyed to the memory 26c. Following this, the Freeman vectors for each primitive Pr are conveyed to the comparator 26d. Thereafter, adjacent Freeman vectors generated for each primitive are compared. If adjacent vectors are identical, one of the vectors is deleted and the scalar count therefor is incremented. The results from the comparator 26d are then conveyed to the memory 26e.
For example, when the primitive P^ shown in
Figure 4a is processed to form the series of Freeman vectors FV3 , FV3 , FV3 , FV3 , the comparator 26d reduces the series of vectors to the vector code FV3 having a scalar of 4. If, for example, a primitive was entered and a series of Freeman vectors equal to FV3 , FV3 , FV3 , FV4 , FV4 , FV4 , FV5 , FV5 , FV3 was generated therefor, the series of unit vectors would be reduced to the vector code FV3 , FV4 , FV5 , FV3 , and a series of scalars equal to 3, 3, 2, 1 would be generated.
From the memory 26e, the vector code and associated series of scalars for each primitive forming the entered character are conveyed to the primitive detection section 28. The vector codes are applied to the comparator 28a and the series of scalars are stored in the memory 28c. The vector codes received by the comparator 28a are compared with the vector codes stored in the primitive dictionary 28b. The dictionary 28b is partitioned into sixteen primitive code sections, the first fifteen sections of which are uniquely associated with one of the fifteen primitives Pra to Pr0 and store
SUBSTITUTESHEET vector codes uniquely associated with that primitive. The sixteenth section holds ambiguous vector codes which can represent more than one of the primitives. The sixteenth section also holds unique test information for each ambiguous vector code to allow the correct entered primitives to be determined.
If the vector code for an entered primitive is equivalent to a vector code found in one of the first fifteen sections of the dictionary 28b, the primitive code a to o associated therewith is conveyed to the memory 30. This process is performed for each of the vector codes generated for each primitive forming the entered character. Thus, a series of primitive codes or a character code is generated, the character code of which represents the ideographic character entered on the digitizer tablet 20.
However, when a vector code generated for one of the primitives is compared with the contents of the dictionary 28b and it is equivalent to a vector code stored in the sixteenth section, the test information associated with the ambiguous vector code is applied to the test section 28d. The test section 28d receives the test information and examines it to determine which vector code is being examined. Thereafter, the test section 28d receives the series of scalars associated with the examined vector code and performs operations thereon, the operations of which are determined by the unique test information. The results of the tests are conveyed back to the dictionary 28b which in turn selects the correct primitive code that represents the entered primitive. The series of scalars provide suitable information to discriminate between each ambiguous vector code since although the vector codes are ambiguous, the value of each scalar in the series are typically very different.
For example, if the primitive Pra ' illustrated in Figure 8a was entered on the tablet 20, a vector code equivalent to FVj^ , FV2 , FVχ would be generated. However, the vector code would be detected in the sixteenth section of the dictionary 28b since this vector code is also used to represent the primitive Prb illustrated in Figure 8b. Although the vector codes for the two primitives are identical, the series of scalars associated therewith are very different. As can be seen the series of scalars associated with the primitive Pra would be 3, 1, 3 whilst the series of scalars associated with primitive Prb would be 1, 5, 1. Thus, by comparing the relative lengths between the first two scalars in the series, the correct primitive code can be determined.
If the vector code being compared with the contents of the dictionary 28b is not equivalent to a vector code located therein, the vector code is assigned an unidentified primitive code U which is similarly applied to the memory 30. Thus, the output of the primitive detection section 28 comprises a series of primitive codes or a character code, which represents the inputted ideographic character IC.
The character code stored in the memory 30 is applied to the character code recognition section 32 and received by the comparator 32a. The comparator 32a compares the character code with the contents of the character dictionary 32b generated for the entered character. As mentioned previously, the dictionary 32b stores a character code for each of the possible ideographic characters in the language along with its
SUBSTITUTE SHEET corresponding international ASCII output code. The international ASCII output code is used internationally to represent the ideographic character. Since a number of ideographic characters are formed from the same primitives entered in the same order, some ideographic characters have identical character codes although the relative positions between the entered primitives are very different. To allow the apparatus 10 to detect the proper ideographic character when an ambiguous character code is received, the character dictionary 32b also contains test information uniquely associated with each ambiguous character code.
When a character code is received from the memory 30, it is compared with the contents of the dictionary 32b via comparator 32a. If received character code is equivalent to a character code found in the dictionary 32b that is uniquely associated with only one ideographic character, the international ASCII output code associated therewith is outputted from the dictionary 32b and stored in the memory 34. However, when the character code generated for the entered ideographic character is equivalent to an ambiguous character code that is associated with more than one ideographic character, the unique test information associated therewith is applied to the character differentiator 32c.
Upon reception of the test information, the differentiator 32c retrieves the unprocessed cartesian co-ordinate data from the memory 22 and performs operations thereon as determined by the test information in order to determine the international ASCII output code that represents the inputted ideographic character. When performing the test operations, the unprocessed cartesian co-ordinate data points are used' as opposed to
SUBSTITUTESHEET the series of scalars formed therefor, since the unprocessed cartesian co-ordinate data contains information regarding the relative position of each of the entered primitives. When the correct international ASCII output code has been determined, it is similarly conveyed to the memory 34.
For example, if the ideographic character illustrated in Figure 1 was entered into the apparatus, a character code equal to "aba" would be generated and compared with the contents of the dictionary 32b. However, the character code would be detected as being ambiguous since the ideographic characters IC2 and IC3 shown in Figures 9a and 9b respectively are also represented by the same character code "aba". The unique test information associated with the character code "aba" would be applied to the differentiator 32c, along with the unprocessed cartesian co-ordinate data from the memory 22. For this example, the test information would cause the differentiator 32c to examine the position of the second primitive Pr2 with respect to the first primitive Prχ to determine if the second primitive Pr2 passes through the first primitive Prx . If the result of this test was negative, the differentiator 32c would acknowledge that the entered ideographic character IC is not equivalent to ideographic character IC2 since this feature is not present in the character IC2. To distinguish between the ideographic character IC and IC3, the third primitive Pr3 is compared with the first primitive Prχ forming the entered ideographic character IC and the relative sizes therebetween are examined. The result of this test enables the differentiator 32c to select the correct international ASCII output code since the primitive P^ is smaller than the primitive Pr3. The dictionary 32b receives the results generated by the differentiator 32c and the correct international ASCII output code is conveyed to the memory 34.
After the international ASCII output code has been determined and stored in the memory 34, it can be applied to output devices such as a printer 16a, a VDT terminal 16b or an audio synthesizer 16c in order to produce an image of the inputted ideographic character.
However, if the character code is formed from a series of primitive codes wherein one or more of the primitives have been assigned unidentified primitive codes U or if the character code is not equivalent to any of the character codes found in the character dictionary 32b, the character code is applied to the substitution and correction section 36. The substitution and correction section 36 includes the probability matrix 36a, which is in the form of a sixteen row by fifteen column array of registers 36a ' . As shown in Figure 10, each row of the matrix is associated with one of the possible sixteen primitive codes a to o including the unidentified primitive code U and each of the columns of the matrix is associated with one of the fifteen possible primitive codes a to o. Each of the registers 36a ' holds a number representing the probability that the primitive code of the row could be mistaken for the primitive code of the column.
Thus, the probability values stored in the registers along the left to right diagonal of the matrix 36a all have values of 1 since the probability that a primitive code will be detected as itself is high. The probability of two very dissimilar primitives being mistaken for one another is highly improbable and thus, the probability values stored in a register associated with two dissimilar primitives is typically zero. For
SUBSTITUTE SHEET example, looking at the first row of the matrix 36a which is associated with the primitive Pra , the probability that the primitive Pra could actually be mistaken for primitive Prc is 0.0 since these primitives are very different in the manner in which they are formed. Primitives which have some similarities to other primitives are assigned probability values ranging from 0.1 to 0.9, depending on the number of similarities therebetween.
When a character code is received in the comparator 36b having at least one unidentified primitive code U therein, the probabilities in the row associated with the primitive code U are examined. When the highest probability value in the row is detected, the primitive code of the column is used to replace the unidentified primitive code U. The resultant character code is conveyed back to the comparator 32a and is compared with the contents of the character dictionary 32b to detect if the resultant character code is equivalent to a character code found therein. If the resultant character code is equivalent to a character code in the dictionary, the international ASCII output code is retrieved from the dictionary 32b and conveyed to the memory 34 wherein it is stored. If the resultant input character code is equivalent to an ambiguous character code, tests are performed on the cartesian co¬ ordinate data stored in the memory 22 in the same manner as previously described to determine the correct international ASCII output code.
However, if the resultant character code is not equivalent to a character code found in the dictionary 32b or if the originally entered character code does not correspond with a character code found in the dictionary 32b, a second substitution is performed.
SUBSTITUTESHEET When one of the above cases occurs, the character code is conveyed to the comparator 36b and examined to identify the number of primitive codes forming the character code. Following this, each character code in the character dictionary 32b formed from the same number of primitive codes is conveyed to the comparator 36b and compared with the unidentified character code. During this comparison, the number of differences between the primitive codes forming each of the character codes and the primitive codes forming the unidentified character code are examined. If the number of differences detected between the character code and the unidentified character code is greater than a threshold value, the character. code is discarded.
However, every character code having a smaller number of differences than the threshold value is noted and the international ASCII output code associated therewith is stored in the memory 36c. The order of the international output codes stored in the memory 36c is chosen so that the first international ASCII output code in the memory is associated with the character code most similar to the unidentified character code. The international output codes stored in the memory 36c are then retrieved from the memory 36c and conveyed to the
VDT terminal, thereby displaying to the user each of the ideographic characters that are most likely to be equivalent to the entered ideographic character. The user may then choose the ideographic character corresponding to the ideographic character that was entered into the apparatus 10 via suitable editing software. If the substitution section 36 does not produce the desired ideographic character, editing programs can be used to retrieve the correct international ASCII output code from the dictionary 32b. The ideographic character signals stored in the memory 34 can be coupled to the printer 16a to allow a reproduction of the inputted ideographic character to be generated. Furthermore, the character signals can be conveyed to the VDT screen 16b to allow the user to view the characters that has been entered into the apparatus 10. The apparatus 10 is also capable of functioning with known editing programs to allow the user to change the ideographic character signals stored in the memory 34.
When the apparatus 10 is conditioned in one of the other modes so that the apparatus functions to recognize characters of a different language, the same set of primitives shown in Figure 3 are used to form the characters. It should be apparent that the primitives shown in Figure 3 are particularly useful in forming ideographic and upper case English language characters since all of the characters in these languages can be formed from these primitives. However, it should be appreciated that other primitives may have to be added so that all of the characters in all languages can be formed, however, this will be rare since the twenty primitives should be capable of forming substantially all of the characters in every language.
As mentioned previously, the dictionaries in the processor 14 are partitioned with each partition holding the various primitive codes, character codes and ASCII output codes for each upper case character in the other languages. The upper case characters are stored in the apparatus since these characters are typically written in the same manner and order by everyone versed in the language. The various sections in the processor also include test information to allow different characters which generate the same character code to be recognized.
For languages which use strokes similar to primitives Prp to Prt when forming the characters therein, the primitive detection and primitive code determination is performed in the same manner previously described using the Freeman coding except when one of the primitives Prp to Prt are entered on the tablet 20. Accordingly, When a primitive is entered on the tablet 20, the feature extraction section 26 examines the tangents of the lines formed between the sampled points along the primitive to determine the degree of curvature of the primitive (ie. 180°, 270°, 360°) prior to using the Freeman Coding.
If the primitive is detected as having a curvature of substantially 270° or 360° , the primitive code s or t associated with the entered primitive Prs or Prt is immediately determined without further processing. If the curvature of the primitive is detected as being approximately 180° , the starting and ending co-ordinate data signals of the primitive are examined along with the direction of the tangents (ie. clockwise or counter-clockwise) This allows the primitives Pr to Prr to be differentiated without requiring further processing. Other wise if the entered primitive is not detected as having a substantially constant gradient when examining the tangents, the pre- processed co-ordinate data signals are processed using the Freeman coding to determine the correct primitive code.
For example, referring to Figure 11, if the apparatus is conditioned to recognize English language characters and the character " " is entered on the
SUBSTITUTE SHEET tablet 20, the primitives Prb , Prg, Prc and Prb are used to form the character. These primitives are processed by the feature extraction section 26 and the primitive detection section in the same manner previously described. Accordingly, a character code equal to
"bgcb" would be generated. The associated ASCII output code would outputted since this code is only associated with the character "M" in the English language.
If for example, the English characters "D" and
"P" were entered on the tablet 20 as shown in Figure 12, the character code generated for each character would be "bq" since the primitives forming both characters are Prb and Prq . Thus, if one of these characters is entered, test information stored in the character dictionary is used in a similar manner to that previously described and the length of the primitive Prb and the length between the starting and ending points of primitive Pr.. are examined. This allows the two characters to be differentiated even though the character codes generated for the two characters are the same.
With respect to other languages such as German, French etc. the method of detecting the handwritten characters is the same although the apparatus must be conditioned to the appropriate mode via means 18. This is even necessary for languages like German ,French and English wherein the characters forming the language are the same .since the ASCII output codes therefor are different. The substitution matrix can also be used for each of the other languages although it is not necessary due to the small number of characters used in non-ideographic languages.
SUBSTITU E SHEET Also, when the apparatus 10 is conditioned to detect upper characters of a language, the device is also included with software for outputting the ASCII code for the lower case equivalent of the detected upper case character if desired. Although the lower case letters can be detected in a similar manner to the upper case letters, lower case letters are typically written differently by individuals thereby making the detection process more difficult and requiring more memory space to permit detection of the character in the many possible ways that it can be written.
The present apparatus has been employed in an IBM PC XT personal computer manufactured by International Business Machines provided with a 20 Mb hard disk which functions to store the information for the dictionaries. To perform the identification processes described hereinabove, the computer is supplied with the appropriate software which allows the input cartesian co-ordinate data point signals to be processed in the above-mentioned manner. Since a large amount of data is stored in the dictionary 32b, ie. character codes and associated international output codes for approximately 50,000 different ideographic characters, a B-tree algorithm which is well known in the art is used to increase the speed of the detection between the character code generated for the inputted ideographic and the character codes stored therein. Although the B-tree algorithm increases processing speed, it also increases memory requirements, since indexing files are required.
The present apparatus 10 can also be manufactured on a small integrated circuit board capable of being coupled to a conventional personal computer, the board of which is provided ROM components to store
SUBSTITUTE SHEET the various dictionary contents and a microprocessor including appropriate software to perform the data processing functions.
Thus, the present apparatus provides the advantages of being able to distinguish between characters which are formed from the same primitives entered in the same order. This decreases the occurrences of an operator having to halt data entry operations in order to choose the correct ideographic character. Moreover, the substitution means further decreases the above-mentioned occurrence since allowing the present apparatus to choose a different character code that is most similar to the entered character code, if the input character is not found in the apparatus 10. Furthermore, since the apparatus can be generated using software or manufactured using hardware components, the apparatus is versatile and can be used in various environments.
The present device also provides further advantages in that the manner in which the entered strokes are processed in the apparatus, allows the strokes to be written substantially anywhere on the tablet surface except for the small number of characters which generate an ambiguous character code. Also, the processing used prior to the determination the primitives forming the character allows the entered characters to be determined irrelevant of the length of the entered primitives except for a few exceptions.
Furthermore, the simply approach and processing allows handwritten characters in substantially all languages to be recognized quickly thereby allowing the device to be used in real-time applications.
SUBSTITUTESHEET It should be apparent to one skilled in the art that the present device can be modified to detect any inputted character provided the appropriate information regarding the character to be detected is stored in the dictionaries located therein.

Claims

We Claim:
1. A character recognition apparatus for identifying characters formed from a number of primitives, said characters and primitives being members of predetermined sets, said apparatus comprising: input means for receiving successively each of the primitives forming said character and generating input signals for each of said received primitives; processing means receiving said input signals and identifying each of said primitives received by said input means, said processing means generating a character code representing said character upon identification of said primitives; storage means storing a character code and an associated output code for each of the characters in said set; comparing means comparing said character code generated for said entered character with each of said character codes in said storage means to identify said entered character; and output means in communication with said comparison means and generating a reproduction of said entered character upon the identification thereof by said comparison means.
2. The character recognition apparatus as defined in Claim 1 further comprising: differentiation means examining said input signals generated for each of said primitives and performing operations thereon when said character code is equivalent to a character code associated with a plurality of output codes to identify the output code associated with said character.
SUBSTITUTESHEET
3. The character recognition apparatus as defined in Claim 2 wherein said primitives are capable of forming every character in a plurality of languages, said storage means storing a character code and an output code for each of said characters in all of said languages.
4. The character recognition apparatus as defined in Claim 3 wherein said storage means further stores character test information, said test information being provided for each character code associated with more than one output code, said differentiating means including a processor for receiving said character test information and said input signals and performing operations thereon in accordance with said character test information to detect the output code corresponding to said character..
5. The apparatus as defined in Claim 2 wherein said predetermined set of primitives comprises twenty distinct primitives, the various combination of said twenty primitives being capable of forming any characters in said languages.
6. The apparatus as defined in Claim 3 wherein said processing means generates a primitive code for each of said identified primitives, said apparatus further comprising substitution means, said substitution means receiving said character code when it is not equivalent to any of said character codes in said storage means, said substitution means including comparator means for comparing each primitive code forming said character code with the corresponding primitive code of said character codes in said storage means having the same number of primitive codes as input
SUBSTITUTE SHEET character code generated for said received character; and a memory for storing the output code associated with each of the character codes in said storage means having fewer than a predetermined number of differences when compared with said generated character code.
7. The apparatus as defined in Claim 6 wherein said substitution means further comprises a probability matrix, said matrix generating a substitution primitive code most likely to be the unidentified primitive code when said substitution means receives a character code having at least one unidentified primitive code therein and substituting said substitution primitive code for said unidentified primitive code in an attempt to form a character code equivalent to a character code stored in said storage means, and most likely to represent said characte .
8. The apparatus as defined in Claim 1 wherein said input means is an on-line digitizer tablet providing cartesian co-ordinate data for each of said primitives forming said character, said processing means further comprising encoding means for examining said cartesian co-ordinate data for each of said primitives and forming therefrom a series of unit vectors.
9. The apparatus as defined in Claim 8 wherein said encoding means is a modified Freeman encoder which includes a plurality of Freeman unit vectors.
10. The apparatus as defined in Claim 9 wherein said processing means further comprising: feature extraction means for receiving said series of unit vectors for each of said primitives and eliminating redundant unit vectors to form a vector code and an associated series of scalars for each of said primitives; holding means for storing vector codes and an associated primitive code representing each of said primitives in said set along with an unidentified primitive code; and comparator means for comparing said vector codes generated for said character with said vector codes stored in said holding means, said comparator means, outputting said primitive code when said vector code is equivalent to a vector code stored in said holding means and outputting said unidentified primitive code when said vector code is not equivalent to a vector code stored in said holding means.
11. The apparatus as defined in Claim 10 wherein said holding means is further provided with primitive test information, said information being uniquely associated with vector codes which represent more than one primitive, said processing means further comprising a test section receiving said primitive test information and said series of scalars associated with said vector code and performing operations thereon to detect the correct primitive code associated with said vector code when said vector code is equivalent to a vector code representing more than one primitive code.
12. The apparatus as defined in Claim 11 wherein said output means is selected from the group comprising: a printer, an audio-synthesizer and a video display terminal.
13. An apparatus as defined in Claim 8 further comprising a pre-processing means for receiving said cartesian co-ordinate data, said pre-processing means comparing the distance between first and adjacent second co-ordinates and removing said second co-ordinate if said distance is less than a predetermined threshold value thereby reducing the amount of redundant data.
14. A method of identifying a character formed from a number of primitives, said character and said primitives being members of predetermined sets, said method comprising the steps of: receiving successively each of said primitives forming said character and generating input signals for each of said primitives; examining said input signals to identify each of said entered primitives forming said character; generating a primitive code for each of said primitives to form a character code upon identification of said primitives forming said character; storing a character code and an associated output code for each of said characters in said set; comparing said character code with each of said character codes stored to detect said output code when said character code is equivalent to a character code associated with only one output code; and examining said primitive codes generated for said character and performing operations thereon when said character code is equivalent to a character code associated with more than one output code in order to detect the output code associated with said entered character; and generating an image of said character upon deletion of said associated output code.
15. A method as defined in Claim 14 wherein character test information is provided, said information being uniquely associated with character codes
SUBSTTTUTE SHEET associated with more than one character signal, said method further comprising the steps of: receiving said character test information and said input signals and performing operations thereon in accordance with said character test information to detect the output code corresponding to said character.
16. A method as defined in Claim 15 further comprising the step of forming said set of primitives from twenty distinct primitives, the various combinations of said twenty primitives being capable of forming any character in a plurality of languages.
17. A method as defined in Claim 16 further comprising the steps of: receiving said character code when it is not equivalent to any of said stored character codes; comparing each of said primitive codes forming said character code with the corresponding primitive code of said stored character codes having the same number of primitive codes therein as said character code generated for said entered character; and storing the output code associated with each of said stored character codes when said stored character code has fewer than a predetermined number of differences than said character code when compared therewith.
18. A method as defined in Claim 17 further comprising the steps of: receiving an input character code having at least one unidentified primitive code therein; and substituting for said unidentified primitive code, the primitive code most likely to be the unidentified primitive code in an attempt to form a character code equivalent to a stored character code.
SUBSTITUTESHEET
19. A method as defined in Claim 18 further comprising the steps of: providing a digitizer tablet for generating cartesian co-ordinate data for each of said primitives forming said ideographic characters; and encoding said cartesian co-ordinate data to form a series of unit vectors for each of said primitives.
20. A method as defined in Claim 19 wherein said encoding is performed by a modified Freeman encoder which includes a plurality of Freeman vectors.
21. A method as defined in Claim 20 further comprising the steps of: examining said series of unit vectors for each of said primitives and eliminating redundant unit vectors to form a vector code on an associated series of scalars for each of said primitives; storing vector codes and said associated primitive code representing each of the primitives in said predetermined set; comparing vector code for each primitive with said stored vector codes; and generating said primitive code when said vector code is equivalent to a stored vector code and generating said unidentified primitive code when said vector code is not equivalent to any of said stored vector codes.
22. A method as defined in Claim 21 further comprising the steps of: providing primitive test information associated with vector codes that represent more than one primitive code; and performing operations in accordance with said test information on said series of scalars associated with said vector code to detect the correct primitive code associated therewith, when said vector code is equivalent to a vector code representing more than one primitive cod .
23. A method as defined in Claim 22 further comprising the steps of: determining the distance between first and second adjacent cartesian co-ordinate data points; comparing said distance with a predetermining threshold value; and removing said second adjacent cartesian co¬ ordinate data point if said distance is less than said predetermined threshold value.
EP89900859A 1987-12-11 1988-12-12 Character recognition apparatus Withdrawn EP0396593A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13173487A 1987-12-11 1987-12-11
US131734 1987-12-11

Publications (1)

Publication Number Publication Date
EP0396593A1 true EP0396593A1 (en) 1990-11-14

Family

ID=22450781

Family Applications (1)

Application Number Title Priority Date Filing Date
EP89900859A Withdrawn EP0396593A1 (en) 1987-12-11 1988-12-12 Character recognition apparatus

Country Status (6)

Country Link
EP (1) EP0396593A1 (en)
JP (1) JPH03502841A (en)
KR (1) KR900700973A (en)
CN (1) CN1019612B (en)
CA (1) CA1309774C (en)
WO (1) WO1989005494A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6128409A (en) * 1991-11-12 2000-10-03 Texas Instruments Incorporated Systems and methods for handprint recognition acceleration
JP6491438B2 (en) * 2014-08-29 2019-03-27 株式会社日立社会情報サービス Migration support device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS51118333A (en) * 1975-04-11 1976-10-18 Hitachi Ltd Pattern recognition system
US4365235A (en) * 1980-12-31 1982-12-21 International Business Machines Corporation Chinese/Kanji on-line recognition system
JPS5975375A (en) * 1982-10-21 1984-04-28 Sumitomo Electric Ind Ltd Character recognizer
US4561105A (en) * 1983-01-19 1985-12-24 Communication Intelligence Corporation Complex pattern recognition method and system
JPS60217477A (en) * 1984-04-12 1985-10-31 Toshiba Corp Handwritten character recognizing device
EP0195680A3 (en) * 1985-03-21 1987-06-10 Immunex Corporation The synthesis of protein with an identification peptide
JPS621086A (en) * 1985-06-26 1987-01-07 Toshiba Corp Character input device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO8905494A1 *

Also Published As

Publication number Publication date
KR900700973A (en) 1990-08-17
CN1035195A (en) 1989-08-30
JPH03502841A (en) 1991-06-27
WO1989005494A1 (en) 1989-06-15
CA1309774C (en) 1992-11-03
CN1019612B (en) 1992-12-23

Similar Documents

Publication Publication Date Title
US5034989A (en) On-line handwritten character recognition apparatus with non-ambiguity algorithm
KR100297482B1 (en) Method and apparatus for character recognition of hand-written input
JP3560289B2 (en) An integrated dictionary-based handwriting recognition method for likely character strings
US5664027A (en) Methods and apparatus for inferring orientation of lines of text
US6272242B1 (en) Character recognition method and apparatus which groups similar character patterns
US4653107A (en) On-line recognition method and apparatus for a handwritten pattern
US5161245A (en) Pattern recognition system having inter-pattern spacing correction
Xu et al. Prototype extraction and adaptive OCR
JP2005242579A (en) Document processor, document processing method and document processing program
US3839702A (en) Bayesian online numeric discriminant
EP0396593A1 (en) Character recognition apparatus
US6320985B1 (en) Apparatus and method for augmenting data in handwriting recognition system
JPH10177623A (en) Document recognizing device and language processor
EP1010128B1 (en) Method for performing character recognition on a pixel matrix
JP3233803B2 (en) Hard-to-read kanji search device
CN114022886B (en) Handwriting recognition training set generation method, system and medium for tablet
JPH09114926A (en) Method and device for rough classifying input characters for on-line character recognition
JP2953162B2 (en) Character recognition device
KR100258934B1 (en) Apparatus and method for recognizing english word on line by selecting alphabet from the alphabet groups
JPH0830717A (en) Character recognition method and device therefor
JPS60138689A (en) Character recognizing method
JPS6059487A (en) Recognizer of handwritten character
JPH09319826A (en) Hand-written character recognition device
JPH0520503A (en) Character recognizing device
Dworsky et al. IP industry: Nordstrom or K-Mart?

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19900611

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH DE FR GB IT LI LU NL SE

17Q First examination report despatched

Effective date: 19920716

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 19950221