CN1549192A

CN1549192A - Computer identification and automatic inputting method for hand writing character font

Info

Publication number: CN1549192A
Application number: CNA031190782A
Authority: CN
Inventors: 周非凡; 程卓; 凡东; 曾俊玲; 张惠捷
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2003-05-16
Filing date: 2003-05-16
Publication date: 2004-11-24
Anticipated expiration: 2023-05-16
Also published as: CN100485711C

Abstract

The hand writing distinguishing and inputting method in computer includes the following steps: image pre-treatment of the hand writing input from scanner; hand writing font extraction including line separation with the horizontal projection of text line and character separation with the vertical projection of text character; establishing template of computer font and hand written font including font characteristic vector extraction and classification; character matching including font characteristic extraction and matching in computer; and hand writing distinction via establishing the corresponding relation between hand writing font and computer font. The present invention is simple and convenient, and can facilitate man-machine conversation.

Description

Computer identification and the automated input methodology of hand-written script

Technical field

The present invention relates to computer identification and the automated input methodology in Chinese information processing technology field, particularly a kind of hand-written script.

Background technology

The identification of the computer of hand-written script and automatic input system are one of the present natural language processing field very problems of focus. Its major function is embodied in can process arbitrarily hand-written contribution, and the panel computer that popular handwriting pad and Microsoft release on the market has solved the time-consuming problem of words input to a certain extent, has embodied the superiority of office automation. But handwriting pad and panel computer also have very large shortcoming: expensive, common people are difficult to bear; During use, also need carry. In addition, for hand-written script, for example write the hand-written manuscript on paper, and be printed on the written historical materials such as hand-written script on the carrier and font, computer can't be accomplished automatically identification and automatically input at present, needs by manual identified and input.

Summary of the invention

Technical problem to be solved by this invention is: computer identification and automated input methodology that a kind of hand-written script is provided, it not only makes calculates the automatic identification of function by the hand-written manuscript of scanner input, and can identify simultaneously hand-written script and the font on the carrier of being printed on by scanner input, and the pictorial information of text is converted into the character code form that computer can directly be processed, finish the computer of text and automatically input.

The present invention solves the technical scheme of its technical problem, comprising:

1) hand-written script of scanner input carried out visual pretreated step;

2) extraction of hand-written script font, its step comprises: row cutting and character segmentation,

The row cutting utilizes the floor projection of line of text to carry out cutting,

Character segmentation utilizes the upright projection of text word to carry out cutting;

3) modeling of computer font, its step comprises: the font style characteristic vector extracts and sorts out;

4) modeling of hand-written script, the same with the modeling procedure of computer font;

5) characters matching, its step comprises: the font style characteristic vector of computer extracts and coupling,

The font style characteristic vector of computer extracts, finished by the modeling procedure of computer font,

The font style characteristic Vectors matching of computer comprises the coupling of single character and detection coupling and the error correction of sentence;

6) identification of hand-written script the steps include:

After hand-written script has carried out feature extraction, carry out feature coding according to font style characteristic vector classifying method,

After each stack features is finished coding, at first in feature database, seek its respectively index value of correspondence,

After the index codes of correspondence found, next step was exactly the rule of correspondence according to mapping table, sought corresponding standard GB/T code by its corresponding index codes, thereby set up the corresponding of hand-written script and computer font;

Above-mentioned steps 1) to 5) be the step of automated input methodology.

Major advantage of the present invention is as follows:

One. can make calculate function automatically identification can automatically identify hand-written script and the font on the carrier of being printed on by the scanner input simultaneously by the hand-written manuscript of scanner input.

They are two years old. and the pictorial information of text can be converted into the character code form that computer can directly be processed, finish the computer of text and automatically input.

They are three years old. and easy to use: the writer need only provide hand-written manuscript, can operate computer by itself or other people, with the hand-written manuscript such as the various manuscripts of scanner input, mail, note, signature and be printed on hand-written script on the carrier and the written historical materials such as font, finish automatic identification and input, thereby solved veritably the problem that can not input, realized convenient man-machine dialog interface.

They are four years old. need not to typewrite again, laborsaving, save time, less manpower. Support the use with printer, just can print above-mentioned written historical materials, thereby solved veritably the problem consuming time of input, can save duplicator simultaneously.

They are five years old. and application prospect is very open: be applicable to office, publishing house and newspapers and periodicals society, and individual's use etc., market potential is large.

Description of drawings

Fig. 1 is main program flow chart of the present invention.

Fig. 2 is the floor projection schematic diagram of row cutting.

Fig. 3 is the upright projection schematic diagram of character segmentation.

Fig. 4 is the schematic diagram that the image of single hand-written script is carried out the upper and lower, left and right projection.

Fig. 5 is with the schematic diagram of left to the quantification image that is projected as example.

Fig. 6 be with left to be projected as the example differential after the image schematic diagram.

The specific embodiment

The invention will be further described below in conjunction with embodiment and accompanying drawing.

One. flow process

Comprise:

1) hand-written script of scanner input carried out visual pretreated step;

As shown in Figure 1, also comprise:

6) identification of hand-written script the steps include:

After hand-written script has carried out feature extraction, carry out feature coding according to font style characteristic vector classifying method.

After each stack features is finished coding, at first in feature database, seek its respectively index value of correspondence.

After the index codes of correspondence found, next step was exactly the rule of correspondence according to mapping table, and seeking corresponding internal code by its corresponding index value is the standard GB/T code, thereby set up the corresponding of handwritten form and computer font. But seeking in the middle of the process of code, may return back out that now a plurality of hand-written scripts are to a computer font or appearance, a hand-written script perhaps occurring does not have computer font corresponding with it. Such problem should be solved by corpus-based and statistical language model. Determine the correspondence of the two by the method for probability.

Above-mentioned steps 1) to 5) be the step of automated input methodology.

Two. visual preliminary treatment (known technology)

Handwritten paper at first exists by the form of scanner with picture, then carries out the initialization process of picture, and picture is quantized to make dot matrix (comprising colouring intensity).

Removal paper lattice and so on standard " hot-tempered sound ": for the paper lattice, be different because it has the color of very large standardization and general and font, choose this type of color dot and then remove, can achieve the goal.

Remove stains: the dot matrix that stains manifest is the continuous dot matrix of a slice, and generally more even, for above-mentioned characteristics, can obtain its edge, removes to get final product.

Three. the extraction of hand-written script font

1. go cutting:

Isolation between row and the row, because existence gap clearly between the row, so the performance on the binaryzation dot matrix is the zone that consists of that forms by 0. Utilize the floor projection of line of text to carry out cutting. The purpose of row cutting is from a width of cloth document image, calculates the bound of delegation's literal pixel, thereby obtains line of text.

Because the people has started writing in hand-written process weight minute, utilize gray scale can better embody difference between gap and the handwritten word row.

The method of row cutting is: utilize one group of horizon light alignment shape to do irradiation, thereby obtain projection at a certain coordinate direction, the gray scale of this projection is by how many tolerance of covered " luminous flux ", and formula is,

v_{y} = Σ_{x = 0}^{sx} f_{1} (x, y) f (x, y) - - - (1)

In the formula: f₁(x, y) is the text gray scale image, and f (x, y) is the binary picture of document image, S_xSize for document image.

Between hand-written manuscript is capable and capable very large spacing is arranged generally, but consider again " hot-tempered sound ", so establish a very little very little threshold values v1, if coordinate figure is lower than threshold values, just can think the interval of line of text, if be higher than v1, then can think the shared zone of font itself, so just line of text can be separated accurately.

2. character segmentation:

Line of text just can be carried out the cutting between the word after separating. Because be based on the identification of characteristic vector, so, need to be syncopated as single handwritten word from interline. Between each Chinese character the space is arranged, utilize this space hand-written script can be separated. Generally enough spaces are arranged between the Chinese character, utilize this space to be conducive to the separation of font, but because handwritten form generally has related stroke, demarcation interval is isolated so can not determine the size in each shared interval of word. Make with the light sciagraphy at this and to isolate computing, sciagraphy is to utilize one group of vertical light alignment shape to do irradiation, thereby obtains projection at a certain coordinate direction. If this " shade " has gray scale, then with covered " luminous flux " what tolerance. The outer of this shade is a curve, can make the shape on plane be converted into plane curve. Because it is light to connect the stroke of pen, also is a little less than the gray scale, for better embodying separating effect, utilize gray scale to calculate.

v (x) = Σ_{y = 1}^{sy} f_{1} (x, y) f (x, y) - - - (2)

In the formula: f₁(x, y) is the text gray scale image, and f (x, y) is the binary picture of document image, S_ySize for document image.

Adopting gray scale image is because of the people's unavoidably company's of having pen appearance in the process of writing, and it is generally light than normal stroke to connect pen, and good embodiment can be arranged on gray-scale map, can more significantly represent the space in v (x). Detect the minimum of a value min (x) of v (x), establish a threshold values v2, think the hand-written script region for the point of v (x)＞v2, think interval region between word and the word for the point of v (x)＜v2.

By formula (1), (2), basically can reflect the position at each hand-written script place, namely be syncopated as the absolute version of hand-written script.

Four. the modeling of computer font

1. the font style characteristic vector extracts:

1) sets up the characteristic vector of type matrix: the dot matrix of first image of the single hand-written script that obtains after the cutting being set up a standard, namely be that horizontal direction equates with the vertical direction function upper bound, build up 0/1 dot matrix, for example the image that cuts out is grouped into the geometric center of 48 * 48 dot matrix, namely be that horizontal direction equates with the vertical direction function upper bound, for the extraction of feature is prepared, not process if do not do these, the similarity of literal relatively just can't correctly be carried out. The projection of handwritten word and the dot matrix of standard are compared, carry out binary conversion treatment, this process is finished by the pretreated step of image.

Then, the image of single hand-written script (for example " in " word font) is carried out the upper and lower, left and right projection, obtain the image (seeing Fig. 4) of four stack features vectors.

This figure has reflected rising and the downward trend of stroke, and the waveform definition among the figure is edge function H1 (X), H2 (X), H1 (Y), the H2 (Y) of type matrix. Edge function has abundant information, and the feature of a handwritten word nearly all can show at edge function. In the text of reality, because different fonts, different symbols is even same font also is not wide and not contour, and the position of cutting also can not be accurate in the junction of two fonts, and these all or the accurate extraction of the strong above-mentioned feature of impact.

2) set up the edge function of type matrix: H1 (X), H2 (X), H1 (Y), H2 (Y). Edge function is some rough curves, is unfavorable for carrying out the extraction of characteristic value, and available formula (3) quantizes, and quantizes image and asks for an interview Fig. 5, and this figure is take left to as the example projection.

3) quantize edge function: formula is,

h (x) = Σ_{x 1 = 0}^{b_{1}} (H (x_{1}) + H (x_{1} + \frac{b_{1}}{m})) [u (x - x_{1}) - u (x - x_{1} - \frac{b_{1}}{m})] / 2 - - - (3)

4) characteristic vector of type matrix is extracted: to the quantification edge function of H1 (X), H2 (X), four edge functions foundation of H1 (Y), H2 (Y), respectively four stack features vectors are carried out differential, obtain four groups of vector combinations that consisted of by impulse function. The differential image is asked for an interview Fig. 6, and this figure is routine to being projected as with left.

Can extract three stack features vector by following method for each group impulse function:

Each impulse function represents a direction, with left to be projected as example, positive direction be designated as 1, reciprocally be designated as 0, rearrange sequentially a characteristic vector group S1;

Between per two impulse functions an interval is arranged, write down the ratio at all intervals, for example a (1): a (2): a (3) ... .a (n);

The amplitude of each impulse function can be different, write down the ratio of the amplitude of all impulse functions, b (1) for example, b (2), b (3) ... .b (n);

The like, obtaining different directions is the vector of upper and lower, left and right direction.

Computer font also can be set up vector on the direction of upper and lower, left and right for each computer font.

2. the font style characteristic vector is sorted out:

The amount of calculation that compares in view of characteristic value is too large, proposes a kind ofly to build storehouse thought based on coding.

1) coding

Amplitude vector embodies the fluctuating of font, and its coding method is:

An amplitude vector b (1) is arranged, b (2), b (3) ... .b (n), n are natural number, such data are deposited in computer and are not easy to management and retrieval. Make that b (1) is 1, if b (2)＞b (1), b (2)=1 then, otherwise, b (2)=0, promote that then can be expressed as formula as follows:

If it is 1:4:5:2:3:6 that an amplitude vector is arranged, then corresponding coding is 1:1:1:0:1:1.

Blank vector, the stroke that embodies font distributes, and its coding method is identical with the coding method of amplitude vector.

Symbolic vector, its coding is finished in front, and corresponding equally is the vector that consists of by 1 and 0.

2) example

The coding example of amplitude vector, blank vector, symbolic vector please sees attached list respectively one, two, three.

Five. the modeling of hand-written script

The same with the modeling procedure of computer font.

Six. characters matching

Its step comprises: the font style characteristic vector of computer extracts and coupling.

The font style characteristic vector of computer extracts, and is finished by the modeling procedure of computer font.

The font style characteristic Vectors matching of computer comprises the coupling of single character and detection coupling and the error correction of sentence.

1. the coupling of single character

1) for each Chinese character is corresponding with the call number in characteristic vector storehouse, should set up the property data base concordance list to computer font. In the matching process of the characteristic vector of carrying out afterwards, reducing the calculating of similarity, improve the discrimination of system, is the large characteristic that the present invention designs.

Step is as follows:

By the coding of the characteristic vector of upper and lower, left and right projection, set up the characteristic vector storehouse after the mixing, the hybrid code in the whole characteristic vector storehouse is arranged according to Gray code;

Convert word-base code to 2 system forms;

Set up one by the mapping table (see Table seven) of characteristic vector storehouse to word-base code, word-base code adopts national standard coding GB.

2) between characteristic vector data storehouse and character library, set up concordance list, each Chinese character is encoded, utilize known encode character for computer to carry out Chinese character index.

The foundation in characteristic vector data storehouse comprises:

Six characteristic vectors formerly each Chinese character having been set up, impulse function on the X-axis is as example, set up a list and deposit the ratio at the interval of impulse function, set up the ratio that a list is deposited the amplitude of impulse function, set up the symbol sequence valve that a list is deposited impulse function;

Same foundation is based on three lists of Y-axis;

Then encode;

The indexed sequential of list is performed as follows mode to be arranged:

X------>>Y，

Symbolic vector----〉〉blank vector----〉〉ratio of amplitude,

Symbolic vector only have two kinds of positive and negatives may, represent with 0 and 1, arrange according to the order of Gray code,

Blank vector is ratio, with the ratio integer, since first, encodes from small to large afterwards.

3) set up the example in characteristic vector storehouse with 5 characteristic vectors:

Please see attached list four, five, six.

2. the matching detection of sentence

The detection coupling of sentence, its method is: detected the corpus of being set up by phrase by ternary statistical language model method.

Corpus is at the basis of a large amount of practices statistics statement and phrase commonly used, thereby calculates prior probability and the posterior probability that each word occurs, and then according to the current word that is identified of the Word prediction that has occurred.

If wi is any one word in the text, if known its first two words wi-2 in the text, wi-1 is just can predict the probability that wi occurs with conditional probability P (wi| (wi-2) (wi-1)). The concept of Here it is statistical language model. In general, if represent in the text an arbitrarily word sequence with variable W, it is comprised of a tactic n word, i.e. W=w1w2...wn, and then statistical language model is exactly the probability P (W) that this word sequence W occurs in text. Utilize the product formula of probability, P (W) is deployable to be:

P(W)＝P(w1)P(w2|w1)P(w3|w1 w2)...P(wn|w1 w2...wn-1)

On calculating, this method is too complicated. If the probability of occurrence of any one word wi is only relevant with two words of its front, problem just can be simplified greatly. At this moment language model is called ternary model (tri-gram):

P (W) \approx P (w 1) P (w 2 | w 1) * Π_{i = 1}^{n} P (wi | (wi - 2) (wi - 1))

In general, the N meta-model is exactly to suppose that the probability of occurrence of current word is only relevant with N-1 the word of its front. Importantly these probability parameters all can calculate by Large Scale Corpus. Have such as the ternary probability:

P(wi|(wi-2)(wi-1))≈count((wi-2)(wi-1wi))/count((wi-2)(wi-1))

Cumulative number that the specific word sequence occurs in whole corpus of count (...) expression in the formula.

3. the coupling error correction of sentence:

Join probability model and code identification are identified accurately to hand-written script, and concrete steps are as follows:

Hand-written script is accessed corpus after " GB " storehouse that obtains corresponding computer font by coding, obtain the relevant density of this word and word that its front occurs, if the density of being correlated with is too little, then returns previous feature database;

Symbolic vector moves with the bound line that is no more than up and down 5 code elements, and blank vector and amplitude vector move simultaneously with the bound line that is no more than up and down 20 code elements, and mobile 10 times of each vector is accessed one time corpus;

Surpass 80% until find the probability of which time coupling, can determine the therewith corresponding relation of word of corresponding hand-written script. Reach higher discrimination. Because system directly embeds existing corpus, so do not need the process learnt.

In the very nonstandard situation of script, the error correction link that is absolutely necessary.

Seven. in sum, by a series of modeling and coding, and the utilization of corpus finally, the Chinese character recognition system of setting up, utilized the diversified means such as cutting, classification, coding, realized that computer is to the identification of handwritten word and automatically input.

Eight. subordinate list

Table one amplitude vector

Table two blank vector

Table three symbolic vector

+	--	+	.........	+
+	--	+	.........	+	1	0	1	........	1

Table four amplitude vector 1

Upper projection					Lower projection					Left projection					Right projection					Index value 1
Upper projection					Lower projection					Left projection					Right projection					Index value 1	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0
1	0	0	0	1	1	0	0	1	1	1	0	0	0	1	1	0	0	0	1		1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0
1	0	0	0	1	1	0	0	1	1	1	0	0	0	1	1	0	0	0	1		1	0	0	1	1	1	0	0	1	0	1	0	0	1	0	1	0	0	1	0
1	0	0	1	0	1	0	1	1	0	1	0	0	1	1	1	0	0	1	1		1	0	0	1	1	1	0	0	1	0	1	0	0	1	0	1	0	0	1	0
1	0	0	1	0	1	0	1	1	0	1	0	0	1	1	1	0	0	1	1		。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1		。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。

Table five blank vector 1

Upper projection					Lower projection					Left projection					Right projection					Index value 2
Upper projection					Lower projection					Left projection					Right projection					Index value 2	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0
1	0	0	0	1	1	0	0	1	1	1	0	0	0	1	1	0	0	0	1		1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0
1	0	0	0	1	1	0	0	1	1	1	0	0	0	1	1	0	0	0	1		1	0	0	1	1	1	0	0	1	0	1	0	0	1	0	1	0	0	1	0
1	0	0	1	0	1	0	1	1	0	1	0	0	1	1	1	0	0	1	1		1	0	0	1	1	1	0	0	1	0	1	0	0	1	0	1	0	0	1	0
1	0	0	1	0	1	0	1	1	0	1	0	0	1	1	1	0	0	1	1		。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1		。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。

Table six symbolic vector 1

Upper projection					Lower projection					Left projection					Right projection					Index value 3
Upper projection					Lower projection					Left projection					Right projection					Index value 3	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0
1	0	0	0	1	1	0	0	1	1	1	0	0	0	1	1	0	0	0	1		1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0
1	0	0	0	1	1	0	0	1	1	1	0	0	0	1	1	0	0	0	1		1	0	0	1	1	1	0	0	1	0	1	0	0	1	0	1	0	0	1	0
1	0	0	1	0	1	0	1	1	0	1	0	0	1	1	1	0	0	1	1		1	0	0	1	1	1	0	0	1	0	1	0	0	1	0	1	0	0	1	0
1	0	0	1	0	1	0	1	1	0	1	0	0	1	1	1	0	0	1	1		。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1		。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。	。。

Table seven mapping table

Index 1	Index 2	Index 3	The index coding	GB
Index 1	Index 2	Index 3	The index coding	GB	00001	00001	00001	000010000100001	011010
					00001	00001	00001	000010000100001	011010

Claims

1. the computer of a hand-written script is identified and automated input methodology, comprising:

1) hand-written script of scanner input carried out visual pretreated step; It is characterized in that also comprising:

2) extraction of hand-written script font, its step comprises: row cutting and character segmentation, the row cutting utilizes the floor projection of line of text to carry out cutting, and character segmentation utilizes the upright projection of text word to carry out cutting;

5) characters matching, its step comprises: the font style characteristic vector of computer extracts and coupling, and the font style characteristic vector of computer extracts, and is finished by the modeling procedure of computer font, the font style characteristic Vectors matching of computer comprises the coupling of single character and detection coupling and the error correction of sentence;

6) identification of hand-written script the steps include:

Above-mentioned steps 1) to 5) be the step of automated input methodology.

2. the computer of hand-written script according to claim 1 is identified and automated input methodology, the method that it is characterized in that the row cutting is: utilize one group of horizon light alignment shape to do irradiation, thereby obtain projection at a certain coordinate direction, the gray scale of this projection is by how many tolerance of covered " luminous flux ", formula is

v_{y} = Σ_{x = 0}^{sx} f_{1} (x, y) f (x, y) - - - (1)

3. the computer of hand-written script according to claim 1 is identified and automated input methodology, the method that it is characterized in that character segmentation is: utilize one group of vertical light alignment shape to do irradiation, thereby obtain projection at a certain coordinate direction, the gray scale of this projection is by how many tolerance of covered " luminous flux ", formula is

v (x) = Σ_{y = 1}^{sy} f_{1} (x, y) f (x, y) - - - (2)

4. the computer of hand-written script according to claim 1 is identified and automated input methodology, it is characterized in that the method that the font style characteristic vector extracts is:

1) set up the characteristic vector of type matrix: first the image of the single hand-written script that obtains after the cutting being set up the dot matrix of a standard, namely is that horizontal direction equates with the vertical direction function upper bound, builds up 0/1 dot matrix; The projection of handwritten word and the dot matrix of standard are compared, carry out binary conversion treatment, this process is finished by the pretreated step of image; Then, the image of single hand-written script is carried out the upper and lower, left and right projection, obtains four stack features vector,

2) set up the edge function of type matrix: H1 (X), H2 (X), H1 (Y), H2 (Y),

3) quantize edge function: formula is,

h (x) = Σ_{x 1 = 0}^{b_{1}} (H (x_{1}) + H (x_{1} + \frac{b_{1}}{m})) [u (x - x_{1}) - u (x - x_{1} - \frac{b_{1}}{m})] / 2 - - - (3)

4) characteristic vector of type matrix is extracted: to the quantification edge function of H1 (X), H2 (X), four edge functions foundation of H1 (Y), H2 (Y), respectively four stack features vectors are carried out differential, obtain four groups of vector combinations that consisted of by impulse function

Extract three stack features vector for each group impulse function by following method:

Each impulse function represents a direction, positive direction be designated as 1, reciprocally be designated as 0, rearrange sequentially a characteristic vector group S1,

Between per two impulse functions an interval is arranged, writes down the ratio at all intervals,

Write down the ratio of the amplitude of all impulse functions,

The like, obtain different directions and be the vector on the direction of upper and lower, left and right.

5. the computer of hand-written script according to claim 1 is identified and automated input methodology, it is characterized in that the method that the font style characteristic vector is sorted out is: as follows based on the coding database construction,

Amplitude vector: embody the fluctuating of font, its coding method is,

An amplitude vector b (1) is arranged, b (2), b (3) ... .b (n),

Then formula is as follows:

In the formula: make that b (1) is 1, if b (2)＞b (1), b (2)=1 then, otherwise b (2)=0; N is natural number; Blank vector: the stroke that embodies font distributes, and its coding method is identical with the coding method of amplitude vector; Symbolic vector: its coding is finished in front, and corresponding equally is the vector that consists of by 1 and 0.

6. the computer of hand-written script according to claim 1 is identified and automated input methodology, it is characterized in that the coupling of single character, the steps include:

By the coding of the characteristic vector of upper and lower, left and right projection, set up the characteristic vector data storehouse after the mixing, the hybrid code in the whole characteristic vector data storehouse is arranged according to Gray code;

Convert word-base code to 2 system forms;

Set up one by the mapping table of characteristic vector storehouse to word-base code, word-base code adopts national standard coding GB;

Between characteristic vector data storehouse and character library, set up concordance list, each Chinese character is encoded, utilize known encode character for computer to carry out Chinese character index;

The foundation in characteristic vector data storehouse comprises:

1) six characteristic vectors formerly each Chinese character having been set up, impulse function on the X-axis is as example, set up a list and deposit the ratio at the interval of impulse function, set up a list and deposit the ratio of the amplitude of impulse function, set up the symbol sequence valve that a list is deposited impulse function;

2) same foundation is based on three lists of Y-axis;

3) then encode;

4) indexed sequential of list being performed as follows mode arranges:

X------>>Y，

Symbolic vector----〉〉blank vector----〉 ratio of amplitude, symbolic vector only have two kinds of positive and negatives may, represent with 0 and 1, order according to Gray code is arranged, and blank vector is ratio, with the ratio integer, since first, encode from small to large afterwards.

7. the identification of the computer of hand-written script according to claim 1 and automated input methodology is characterized in that the detection of sentence is mated, and its method is: detected the corpus of being set up by phrase by ternary statistical language model method.

8. the identification of the computer of hand-written script according to claim 1 and automated input methodology is characterized in that the coupling error correction of sentence, and its method is that join probability model and code identification are identified accurately to hand-written script, and concrete steps are as follows:

Surpass 80% until find the probability of which time coupling, can determine the therewith corresponding relation of word of corresponding hand-written script.