CN101930300A

CN101930300A - Digitized Chinese information processing method and random coding method for Chinese characters

Info

Publication number: CN101930300A
Application number: CN2010102741414A
Authority: CN
Inventors: 陈玉龙
Original assignee: 刘陶
Priority date: 2010-09-07
Filing date: 2010-09-07
Publication date: 2010-12-29

Abstract

The invention discloses a digitized Chinese information processing method. The method comprises the following steps of: splitting first-level and second-level national standard Chinese character according to standardized information; and performing feature code information listing on the split information, wherein each Chinese character feature code in a feature code information list comprises a Chinese character national standard code (GB code) and a control code; the national standard code is a Chinese character code; and the control code marks a sequence code of list information (components, initials, finals and strokes) in the process of splitting the Chinese character. In digitized Chinese character information, the Chinese character code is finally turned into a simple agreement between a person and a machine; people only need to input a character and a word according to a preset coding rule; regardless of single type or mixed type information or the code length of coding operation, a computer can generate each type of encoding operation by sampling relevant coding information and judging a coding condition, so that a series of combination coding is realized without coding a code table or performing any switching; and various coding combinations of characters and words can be generated automatically through program design. Therefore, random coding operation technology is realized.

Description

Chinese information digitalization disposal route and Chinese character random coded method

Technical field

The present invention relates to a kind of computer Chinese information process field, particularly a kind of Chinese information digitalization disposal route and a kind of Chinese character random coded method that adopts this digitalized processing method.

Background technology

In existing Chinese information processing field, with the character shape coding is example, code table structural design person is in order to realize word, speech input on QWERTY keyboard, generally all more than 200 component information are decomposed I and II GB Chinese character (GB2312), then these more than 200 component information directly are positioned on 26 character keys, so just the component code that the fractionation Chinese character is obtained becomes exercisable keypad code (code list of Hanzi).And work out a cover code list of words (or assist to generate word, code list of words with the coding maker) separately, put in the lump that the WINDOWS Chinese operating system carries out word, Chinese word coding is operated.Though the code table structure has solved the input of word, speech, because the Chinese character information after decomposing is not passed through digitized processing, computer program can't be operated, not only each encoding scheme all must establishment one be overlapped word, code list of words, and can only realize the encoding operation (single information, single preface sign indicating number, single code length) of unitary class, waste a large amount of manpower and financial resources, brought many inconvenience also for code Design person and operator.

Summary of the invention

At above-mentioned the deficiencies in the prior art, the technical problem to be solved in the present invention provides a kind of Chinese information digitalization disposal route that makes Chinese information carry out sequencing control and handle, so that realize the random coded input operation of word and speech.

For solving the problems of the technologies described above, the present invention adopts following technical scheme:

A kind of Chinese information digitalization disposal route, I and II GB Chinese character is split by the normalization information standard, Chinese character information after the fractionation is carried out the condition code information list, each Hanzi features sign indicating number in the tabulation comprises Chinese character international code (GB sign indicating number) and control code two parts, GB is a kanji code, and control code is the preface sign indicating number of list information in this Chinese character splits then.Whole condition code raw information tabulations (YG table) of collective component, sound mother, stroke are as the information source of Chinese character random coded design.

Preferably, the decomposition standard of I and II GB Chinese character can be " Hanzi component standard ", " Chinese-character stroke standard " or " spelling scheme of Chinese character " three kinds.

The invention also discloses and a kind ofly adopt above-mentioned Chinese information digitalization disposal route to carry out the method for Chinese character random coded, it adopts following technical scheme:

A) at first I and II GB Chinese character is split by " Hanzi component standard ", " Chinese-character stroke standard " and " spelling scheme of Chinese character ", fractionation information comprises the condition code information list of Chinese character international code and control code two parts;

B) set up the tabulation of Hanzi features sign indicating number raw information;

C) digitized raw information is carried out the design of key position;

D) the raw information tabulation is converted into digitized key position information list;

E) set up word, Chinese word coding buffer zone;

F) store key entry information in word, Chinese word coding buffer zone;

G) behind word (or speech) end key, according to word (or speech) encoding characteristics capturing and coding information in word (or speech) buffer zone;

H) differentiate this word (or speech) according to the encoding characteristics (characteristic) of word (or speech) and whether meet encoding condition, detect coded word (or speech), the random coded processing finishes.

Preferably, initial consonant (21) tabulation in the Chinese phonetic alphabet information that I and II GB Chinese character splits by " spelling scheme of Chinese character ", its control code data are " 01 ", and simple or compound vowel of a Chinese syllable (35) control code data are " 02 ".

Preferably, generally adopt " horizontal, vertical, left-falling stroke, point (right-falling stroke); hook " five strokes by the Chinese character stroke that " Chinese-character stroke standard " splits at I and II GB Chinese character, make code with numerical key 1～5, also can be arranged on the character keys virtually by two combination of strokes (5 * 5=25 group), press the numerical key operation, by the character keys tabulation.

Preferably, the label information of control code in the described list of parts comprises first part mark, inferior parts mark, the 3rd parts mark, last parts mark and tail piece mark, and wherein last parts are last parts that refers in particular to four or more parts Chinese character; Tail piece is last parts that general reference comprises all Chinese characters of character formation component, two parts, three parts and multi-part.

Further, the raw information tabulation is converted into digitized key position information list and adopts four code length bond orders (four code length bond orders can be compatible with three code lengths and two code length encoding operations simultaneously), in the bond order of four code lengths, the parts that occur are counted as first part for the first time, are followed successively by second parts, the 3rd parts and last parts thereafter; Equally, the Pinyin information that occurs first in the quadruple linkage is regarded initial consonant as, is thereafter simple or compound vowel of a Chinese syllable; The stroke information that occurs in the quadruple linkage is followed successively by the one or two, the three or four and the 5th end pen.

Preferably, among the described random coded embodiment, no matter but word or Chinese word coding stochastic transformation input information type and need not to switch all.

Preferably, among the described random coded embodiment, for word code, and though be input with category information or heterogeneous information, but stochastic transformation code length and need not to switch.

Preferably, among the described random coded embodiment, comprise and set up word, Chinese word coding buffer zone that its capacity comprises whole character library (GB2312).It act as: input information is stored in word and Chinese word coding buffer zone respectively; Gather the coded message of word (speech) from word (speech) storehouse and go encoding buffer to differentiate word (speech) encoding condition, finally obtain coded word (or speech).

Preferably, the data bit of described word, each Chinese character of Chinese word coding buffer zone is wanted zero clearing before input word, word information.

Technique scheme has following beneficial effect: this Chinese information is carried out digitalized processing method more than 200 fractionation parts, sound mother and stroke information is carried out the condition code information list, each Hanzi features sign indicating number in the tabulation comprises Chinese character international code and control code two parts, GB is a kanji code, and control code is the preface sign indicating number of list information in this Chinese character splits then.This shows that each information after Chinese character splits not only all is kept in the control code of this Chinese character, and has carried out digitized processing.Each information computer after Chinese character splits can both carry out routine processes, thereby all kinds of coded combinations of word, speech all can generate automatically by program design.The example of relevant Hanzi features sign indicating number information list structure sees attached list one.

Passed through the Chinese character information of digitized processing, encode Chinese characters for computer is become be a kind of simple " agreement " between man-machine, no longer need to work out various code tables, only need in system, to set the man-machine coding rule that can both discern, people only need be by predefined coding rule input word and speech, then by the automatic sample code information of system and differentiate encoding condition and finish a series of different codings operations (being random coded).Obviously, set the artificial establishment of coding rule ratio code table and all want simple and convenient and quick, and encoding function is also much bigger by force with coding maker generation code table (the coding maker is not broken away from the disadvantage of code table).The purpose of setting coding rule just allows computer can tell your information such as key position, preface sign indicating number and code length of input to belong to which kind of other coding in its a series of coded combinations.

Through the most significant characteristics of Chinese information after the digitized processing is to include the computer programming track in, also is the core technology that realizes the random coded operation.Here to realize that the Chinese character random coded is that example describes the Programming Methodology after Chinese information digitalization is handled and the powerful information processing function thereof in detail.

Figure of description

Fig. 1 is the process flow diagram of the embodiment of the invention.

Embodiment

(1) technical characterstic of Chinese character random coded:

1, Chinese character information has parts, phonetic, stroke etc.Yet all code table schemes all are the input patterns of single information, single preface sign indicating number, single code length, and a cover words code table can only be finished a kind of input operation.

Random coded is allowed stochastic transformation inhomogeneity information in the input process of word, speech, both can also can mix input by above-mentioned three kinds of inhomogeneity information by component code, phonetic sign indicating number or stroke code operations.Only need follow following rule, which class encoding operation what computer just can be told user's input from a series of coded combination is, and finishes automatic coding by program:

The word code rule: no matter import similar or heterogeneous information, the word of key entry, Chinese word coding information is all by separately decomposition order code fetch, always get the first bond order of this category information earlier, after get time bond order.As first part, initial consonant, inferior parts, simple or compound vowel of a Chinese syllable (four code lengths), or the one or two stroke, initial consonant, the 3rd four-stroke (three code lengths).

The Chinese word coding rule: as two words, first key can be keyed in any information, and if second key information is similar with first key information, and computer will be admitted and be the inferior bond order with category information, as follows parts or simple or compound vowel of a Chinese syllable or the 3rd four-stroke; If different classes of with first key information (as parts), computer will be admitted the first bond order into heterogeneous information, as initial consonant or the one or two stroke.Computer is to the same lead-in of tail word information processing.

If the first bond order of head, secondary word got in three words, the pressure key gauge of tail word is then with two words.

The first bond order of any category information of first, second and third word and tail word got in four words and above word.

The code taking rule of random coded and tradition coding basically identical meet people's conventional thought, need not special memory.

2, random coded is allowed the stochastic transformation code length.If the input quadruple linkage adds SP (space) key, system is promptly by four code lengths processing (being first, second and third and last parts as component code); If the input triple bond adds the SP key, system is then by trigram long process (being the automatic adjusted last parts of first and second parts and system as component code); Two keys add the SP key promptly to be handled by two code lengths.They all are the random coded inputs of different code length, with secondary, three be distinct input pattern.

3, the word that random coded comprises, Chinese word coding series are real-time processing, need not to switch.The user who is familiar with component information keys in component information and presses character keys, and the input consonant, vowel are pressed the SHIFT+ character keys, and the user who is familiar with Pinyin information presses key opposite.Five strokes (horizontal, vertical, left-falling stroke, point, hook) are made code with numerical key 1～5.

4, not only function is strong, enforcement is simple for random coded, processing ease, and need not to work out a bar table, and the word in the system, Chinese word coding series generate automatically by system program.

5, it is very difficult that the code table structural system is revised code Design, even change a key position, all will revise a sheet of word table and vocabulary.The random coded treatment technology is implemented or revises code Design only to need the key position of start-up system to design program, and can implement new random coded operation at once.

Perhaps people can think that the operating function of random coded a bit can not think to doubt, and perhaps think and realize that it is necessarily very complicated and difficult, but you can dispel misgivings after finishing watching this instructions at once.Above-mentioned all coded combinations have all been made feasibility study here, and it is very not complicated to implement random coded, owing to reformed case type code table message structure, the substitute is the condition code information list structure through digitized processing.The Chinese information digitalization is the technical foundation that realizes random coded.

(2), Hanzi features sign indicating number raw information tabulation (YG table)

With the component information is example, and Hanzi features sign indicating number list of parts is exactly the set of more than 200 component information tabulation.

Its list of parts example sees attached list two.Code length in the physical unit tabulation adopts four yards, because the list of parts of four code lengths can be used for the component coding of three code lengths, two code lengths simultaneously.The component count of tail key position this Chinese character of mark in the list structure, and when the conversion code long process, be used for the position adjustment of last key.Component information in each Chinese character is open, so-called open to the outside world is meant whenever and wherever possible and can handles in real time any one component information in the Chinese character, no longer be confined to a certain class encoding process, can handle simultaneously the coded combination of a series of mixing category informations, this is the technical concept of Chinese character " random coded ".

Similarly can set up the condition code tabulation of the Chinese phonetic alphabet (consonant, vowel) and Chinese character stroke information, constitute a complete Hanzi features sign indicating number raw information tabulation (YG table).The list structure of phonetic and stroke information is the same with list of parts.Chinese phonetic alphabet information list is promptly tabulated 21 initial consonants and 35 simple or compound vowel of a Chinese syllable respectively, at first each Chinese character is resolved into initial consonant (comprising zero initial) and simple or compound vowel of a Chinese syllable, the Chinese character GB sign indicating number that belongs to same initial consonant is listed in the tabulation of this initial consonant, and gives control code " 01 "; The Chinese character GB sign indicating number that belongs to same simple or compound vowel of a Chinese syllable is listed in this simple or compound vowel of a Chinese syllable tabulation, and gives control code " 02 ".The foundation of condition code stroke tabulation both can be tabulated by one stroke, i.e. the Chinese character of under 5 strokes, listing to comprise separately, and mark the preface sign indicating number of this stroke in Chinese character; Also can be by two stroke tabulations, 5 different strokes are formed 25 groups of two stroke series, and the Chinese character of listing to comprise under every group of two strokes, and the preface sign indicating number of this pair of mark group of strokes in this Chinese character.

The Chinese phonetic alphabet and Chinese character stroke information China have set up national standard.Though Hanzi component had also once been issued national standard " Hanzi component standard ", and 560 parts are arranged, be not public's approval and employing.In case the parts standard obtains to generally acknowledge, a Chinese condition code raw information tabulation (YG table) that comprises comprehensive Chinese character information will become the normalization information of encode Chinese characters for computer and the information source of all kinds of code Design so, provide the user directly to carry out all kinds of code Design and input operation in system.Also will help Chinese character code to head for unification.

(3), Hanzi features sign indicating number key position information list (JG table)

In order to implement effective keyboard input operation, to on above-mentioned Hanzi features sign indicating number raw information tabulation (YG table) basis, carry out code Design, promptly to carry out the keyboard location, condition code raw information tabulation (YG table) is converted to condition code key position information list (JG table) parts, phonetic, stroke information.This conversion is very simple, just several raw information list collection of affiliated same key position in the information list of same key position, and, be merged into the condition code of a Chinese character the control code data addition of same Chinese character (" or " processing).

Hanzi features sign indicating number key position information components tabulation example sees attached list three.All examples are here chatted needs for example after only being.The female tabulation of sound in the Chinese phonetic alphabet coding is converted to key, and to rank table fairly simple, and initial consonant is generally by its consonant location (CH, SH, ZH locate U, V, I triple bond usually), and difference mainly is the simple or compound vowel of a Chinese syllable location, be 35 simple or compound vowel of a Chinese syllable mergers on character keys.

The key position information list example of consonant, vowel sees attached list four.

The location of Chinese character stroke is simpler, and 25 groups of two stroke series are positioned on 25 character keys virtually, but that actual input is operating as by one stroke is suitable, so as with component code and the compatible input of phonetic sign indicating number, two strokes are combined into one yard.

The two stroke tabulation of condition code example sees attached list five.

Through the design of key position, the YG table finally convert to one can true-time operation comprehensive condition code key position information list (JG table).The JG table is the same with the YG table also to be opening digitized condition code information list.Though it is different with code list of Hanzi structure form, the information content that comprises is identical.The code table structure can only be implemented the unitary class encoding operation, and the JG of integrated information table can be implemented any class word, the Chinese word coding operation, comprise unitary class information coding (component code or phonetic sign indicating number or stroke sign indicating number) and (for example: parts/phonetic mix the category information coding, parts/stroke, phonetic/stroke or parts/phonetic/all kinds of coded combinations such as stroke) and all kinds of codings of different code length, all can on integrated information JG table basis, generate automatically, need not between them to switch, computer can be discerned encoding operations different between them, and I am referred to as meaning " random coded " to whole encode series.All words in the random coded, Chinese word coding operation all need not to work out a bar table.The information of each its input of coded combination that random coded comprises is all followed predefined bond order rule, rather than free screening.

(4), word, Chinese word coding generate processing automatically

For the coded message of keying in is handled fast and effectively, should in system, open up a buffer zone that is used for word, Chinese word coding processing.Characteristics of this buffer zone are to utilize Chinese character international code directly to be converted to the buffer zone address sign indicating number, are referred to as buffer zone GB address.Each Chinese character is provided with a unit (Byte) respectively at word, speech buffer zone, the word of record input, Chinese word coding information and differentiation word, Chinese word coding operation are referred to as to call word GB unit and speech GB unit (can increase to two bytes or more according to actual needs) respectively.Here be example with the component coding, describe the routine processes process of word, Chinese word coding in detail.

1, word code generates processing automatically:

Suppose that we will import " being " this word, pressing key operation is " Pie (C), yarn (E) ", and two parts code lengths do not relate to last key bit manipulation, need not that key is ranked table and adjusts.Computer program is handled as follows:

The first key C: the D0 position of buffer zone word GB unit is inserted in the D0 position (the first bond order position) of the C key being ranked each the Chinese character control code in the table, and the D0D1 of Chinese character control code (first, second bond order position) is inserted the D0D1 position of buffer zone speech GB unit.

The second key E: the D1 position of word GB unit is inserted in the D1 position (the second bond order position) that the E key is ranked Chinese character control code in the table, and the D0D1 of control code inserts the D2D3 position of speech GB unit.

If to detect be word input (SP key) in system behind two keys, computer will be handled by two code length word codes.Each the word GB unit in the word buffer zone will be swept by system, and whether the D0D1 position of differentiating wherein is " 1 ", if this word promptly belongs to encoding Chinese characters.

During the input of three code length parts words, be last key because of relating to triple bond, so the last key position that will rank the parts key that four code lengths are set up in the table adjusts.The component information of " moving " as key feeding character: " two ", " Si " and " power ", system handles with above-mentioned the key position of head, inferior parts " two ", " Si ", when the key of handling the 3rd parts " power " ranks table, detect the tail key position information D 4 in the control code, do you differentiate D4D2 or D4D3 and are " 1 "? if, then " 1 " among the D4 is regarded as the D2 position that last key information is inserted buffer zone word GB unit, if not, the control code of next Chinese character detected.Detect and finish, differentiate the D0D1D2 of each word GB unit in the buffer zone,, be encoding Chinese characters (" moving " character closes and states the word code condition) if all are " 1 ".If among the D0D1D2 " 0 " is arranged, then not encoding Chinese characters.

If key in four code length component information " Si (C), Tou (D), Si (F), youngster (G) ", the Chinese character " system " that meets coding bond order (D0D1D2D3 is " 1 ") will be detected.

The Hanzi component coding generates example automatically and sees attached list six.

2, Chinese word coding generates processing automatically:

Chinese word coding only relates to the head of Chinese character, inferior bond order, does not relate to last key position, does not have the adjustment problem of last key position information.

If we import two words " system ", will key in " Pie (C), yarn (E), Si (E), Tou (D) " quadruple linkage, computer program is handled as follows:

The processing of first and second key is same as above.

Triple bond E: the D2 position of word GB unit is inserted in the D2 position (triple bond tagmeme) of Chinese character control code in the E key tabulation, and the D0D1 of control code inserts the D4D5 position of speech GB unit.

Quadruple linkage D: the D3 position of word GB unit is inserted in the D3 position (last key position) of Chinese character control code in the D key tabulation, and the D0D1 of control code inserts the D6D7 position in the speech GB unit.

System detects speech input (speech end key) behind the quadruple linkage, and the beginning Chinese word coding is differentiated, and it is complicated that the differentiation process of speech sign indicating number is wanted.The automatic generation of speech sign indicating number, though exempted the establishment code list of words, dictionary still needs.Entry in the dictionary is by the GB series arrangement, and " " occupies the first place of dictionary for the word of lead-in.In order to quicken the differentiation process of speech sign indicating number, the speech long array preferably pressed in the entry in the dictionary: two words, three words, four words and multi-character words, and this can exempt, and to detect its speech one by one long; The speech long code that perhaps adds a byte before every word, entry is arranged and just is not subjected to the long restriction of speech like this.Here set Chinese word coding four yards code fetches routinely.After system detects the speech input, just begin to sweep the speech GB unit of buffer zone.It is as follows that Chinese word coding is differentiated process:

1., find out first Chinese character in the Chinese word coding buffer zone (GB sign indicating number), if D0=0 in the speech GB unit then needn't handle following each step.Because it is not the first bond order sign indicating number of this Chinese character that D0=0 represents first key information, and the Chinese word coding rule predetermining, no matter two words, three words or multi-character words, their first key code fetch must be got the first bond order sign indicating number (as first part) of its lead-in.So then get next Chinese character (GB sign indicating number), if D0=1 in its speech GB unit represents that then first key information belongs to the first bond order sign indicating number of this word, then finding out in dictionary with this Chinese character is whole entries of lead-in, and carries out the Chinese word coding feature one by one and differentiate.

2., take out article one word, find out associated encoding Chinese characters and their speech GB unit in buffer zone according to speech length earlier.Differentiate respectively according to the speech of entry is long:

Two words: get behind the speech GB unit of preceding four and tail word of the speech GB unit of its lead-in four, form the Chinese word coding judgement unit.The coding characteristic data of two words are " 99 ", be that D0D3D4D7 position in the judgement unit is " 1 ", they represent first bond order (D4), second bond order (D7) of first bond order (D0), second bond order (D3) and the tail word of lead-in respectively, two words that meet above-mentioned condition are coding words, otherwise are not.

Three words: get the speech GB cells D 0D1 position of lead-in, the D2D3 position of secondary word and the D4D5D6D7 position of tail word, synthetic successively Chinese word coding judgement unit.Its coding characteristic data are " 95 ", and promptly the D0D2D4D7 position in the judgement unit is " 1 ", and they represent first bond order (D0, D2) of lead-in, secondary word and first, second bond order (D4, D7) of tail word respectively.If we change the code taking rule of three words, get first bond order of initial and end word and first, second bond order of middle word, we only need to change the synthesis mode (the D6D7 position of getting the D2D3D4D5 position of D0D1 position, secondary word of the speech GB unit of lead-in and tail word is synthetic) and its coding characteristic data (changing " 65 " into) of its coding judgement unit, other all do not change.

Four words or the above word of four words: take out first, second, third and D0D1 position, D2D3 position, D4D5 position and the D6D7 position compound word coding judgement unit of the speech GB unit of tail word respectively.Its coding characteristic data are " 55 ", and promptly the D0D2D4D6 position in the judgement unit is " 1 ", and they represent in the word first, second, third and first bond order of tail word respectively, and what meet this condition is the coding word, otherwise is not.

3., then differentiate the coding characteristic of the second word of same lead-in by identical method, word differentiation to the last finishes.So far first Chinese character of only having differentiated speech GB cells D 0=1 in the speech buffer zone is whole entries of lead-in.

4., then take out second Chinese character (GB sign indicating number) of buffer zone speech GB cells D 0=1, and to find out with this Chinese character be whole entries of lead-in in dictionary, make above-mentioned same speech juggling.Up to last Chinese character that takes out buffer zone speech GB cells D 0=1, and till having differentiated as whole entries of lead-in.The coding speech that obtains is listed in presenting bank.

The differentiation process of whole Chinese word coding is very long as can be seen, when especially the dictionary capacity is very big, therefore will consider the execution speed of faster procedure as far as possible when program design.Adding the speech long code in dictionary also is this reason.

The Hanzi component Chinese word coding generates example automatically and sees attached list seven.From this example, can find out, in the speech GB unit that the Chinese word coding of two words " system " is decided by in the buffer zone " to be " in the speech GB unit of D1D3 position and " system " the D4D7 position whether be " 1 " (being equivalent to its Chinese word coding judgement unit characteristic " 99 "), if, necessarily meet first and second bond order of " being " word and first and second bond order of " system ", word " system " is the coding word.If key in the coded message " Pie (C), two (I), Ren (C), an ancient type of spoon (B) " of three words " robotization ", its speech judgement unit is synthetic by the D4D5D6D7 position of the D2D3 position of the speech GB cells D 0D1 position of " certainly ", " moving " and " change ", its coding characteristic data are " 95 ", promptly meet first bond order of three words head, secondary word and first and second bond order of tail word, therefore three words " robotization " are the coding word.As a same reason, when keying in " Si (E), stone (A), Pie (C), Si (E) ", will meet the coding bond order (coding characteristic data " 55 ") of four words " coded system ".

It is identical that the word of phonetic and stroke information, Chinese word coding generate word, the Chinese word coding generation processing handled with parts, and their word, Chinese word coding example see attached list eight.

(5), random coded

In fact random coded is exactly a series of predefined mixing category informations codings, and so-called " preestablishing " is meant that these a series of coded combinations all meet word, Chinese word coding rule that computer that instructions sets can be differentiated.Comprise with category information coding and all heterogeneous information coding and word, the Chinese word coding of different code length.Be that example illustrates the feasibility that realizes random coded between the inhomogeneity information for four yards with word trigram, speech below.

1, mix the category information word code

Key in the mixed information of " being ": first part " Pie (C) ", initial consonant " x (SH+X) ", the one or two stroke " Pie (3), second (5) ".Import first key, get the C key and rank the D0 position that the D0 that shows control code inserts word GB unit.To detect be Pinyin information x (SH+X) to computer when we import second bond order.By above-mentioned agreement, it is different with the information type of first bond order, thereby the computer approval gets initial consonant rather than simple or compound vowel of a Chinese syllable, ranks the D0 that shows control code and inserts D1 position in the word GB unit so get its consonant key; Equally, when input the 3rd bond order, it is stroke information that computer detects, all different with the information type of the first two bond order, so computer taking-up D0 from two stroke virtual key position tabulation " X " (being equivalent to two one stroke " Pie, second ") control codes of " Pie (3), second (5) " inserts the D2 in the word GB unit.Behind the word code end key (SP), the pan word code generates buffer zone, will find that the D0D1D2 position in the word GB unit of " being " is " 1 ", and detecting " being " is encoding Chinese characters.Equally, key in head, inferior parts " Pie (C) ", " yarn (E) " and initial consonant " x (SH+X) " triple bond, when input second bond order when " yarn (E) ", computer detects it and first bond order belongs to component information together, thereby gets this key and rank the D1 position that the D1 that shows control code inserts word GB unit; When keying in the 3rd bond order, it is phonetic that computer detects, different with the first two bond order information type, get the D0 position that its consonant key ranks the table control code, insert the D2 position of word GB unit, behind the word end key, computer detects that the D0D1D2 position is " 1 " in the word GB unit of " being ", also can detect " being " word.If key in the one or two stroke " second (5), second (5) " (being equivalent to two stroke virtual key position information " Q "), initial consonant t (SH+T) and simple or compound vowel of a Chinese syllable ong (SH+B), can detect coded word " system ".Certainly also have some repeated code words.Mix category information word code example and see attached list nine.

2, mix the category information Chinese word coding

Key in the mixed information of two words " system ": the initial consonant of " being " " x (SH+X) ", simple or compound vowel of a Chinese syllable " i (SH+H) ", the first part " Si (E) " of " system ", inferior parts " Tou (D) ".Computer will be inserted the D0D1 that initial consonant " x (SH+X) " key ranks each Chinese character control code in the table D0D1 position in the speech GB unit of same Chinese character in the speech buffer zone; The D0D1 that simple or compound vowel of a Chinese syllable " i (SH+H) " key is ranked each Chinese character control code in the table inserts the D2D3 position of the speech GB unit of same Chinese character in the speech buffer zone; Equally, computer will be inserted the D0D1 that two keys of first part of " system " " Si (E) " and time parts " Tou (D) " rank control code in the table respectively the D4D5 position and the D6D7 position of the speech GB unit of same Chinese character in the speech buffer zone.Behind the quadruple linkage, according to the differentiation process of above-mentioned Chinese word coding, detect word one by one from dictionary, the code taking rule long according to different speech makes up the Chinese word coding judgement unit, and the word that meets its coding characteristic data is the coding speech.Take two words of stating, its Chinese word coding judgement unit is made of the D4D5D6D7 position of the speech GB unit of the D0D1D2D3 position of the speech GB unit of " being " and " system ", and its coding characteristic data be " 99 " (key first and the 3rd quadruple linkage because of key entry are same category information).Two words " system " belong to the coding entry.

Mix category information Chinese word coding generation example and see attached list ten.The key entry information of two words " automatically " in the example: the first part of " certainly " " Pie (C) ", initial consonant " z (SH+Z) ", the one or two stroke of " moving " " one (1), one (1) " and initial consonant " d (SH+D) ", its differentiation process is identical with above-mentioned " system ", but the information type difference of head, inferior key and third and fourth key that it is keyed in.Though the two all is two words, the characteristic of differentiating Chinese word coding is different, and the characteristic of Pan Bieing is " 55 " here.Same input two words " coding ": the initial consonant of " volume " " b (SH+B) ", simple or compound vowel of a Chinese syllable " ian (SH+C) ", the one or two stroke of " sign indicating number " " one (1), Pie (3) " and initial consonant " m (SH+M) " thereof, the characteristic of differentiating this two words is " 59 ", because input " volume " usefulness is same category information (all being phonetic), is heterogeneous information (stroke and phonetic) and " sign indicating number " word is used.The key entry information of three words " robotization " in the example: the initial consonant " d (SH+D) " of the first part of " certainly " word " Pie (C) ", " moving " word and initial consonant " h (SH+H) " of " change " word and the one or two stroke " Pie (3), Shu (2) ", its three word coding method characteristics are " 55 "; If the quadruple linkage information of keying in is: the initial consonant " d (SH+D) " of the initial consonant of " certainly " " z (SH+Z) ", " moving " and the head of " change ", inferior parts " Ren (C), an ancient type of spoon (G) ", the coding characteristic data of its three words are " 95 ".Because the mixing category information of four words and above word coding only relates to the first key information of Chinese character, do not involve second bond order, so differentiating the characteristic of their codings and the information type of key entry has nothing to do, as " automatic coding ", " coded system " or " automatic coding system (ACOM) ", no matter key in any information, all belong to first key information, their Chinese word coding characteristic all is " 55 ".

More than a kind of Chinese information digitalization disposal route and Chinese character random coded method that the embodiment of the invention provided are described in detail; for one of ordinary skill in the art; thought according to the embodiment of the invention; part in specific embodiments and applications all can change; in sum; this description should not be construed as limitation of the present invention, and all any changes of making according to design philosophy of the present invention are all within protection scope of the present invention.

Subordinate list one

Hanzi features sign indicating number list of parts topology example:

Annotate: 1, the encode Chinese characters for computer code length in the above-mentioned list of parts is designed to four yards.Four code length list of parts can be used for three code lengths and two code length encoding operations.

2, last parts are last parts that refers in particular to four or more parts Chinese character.

3, tail piece is last parts that general reference comprises all Chinese characters of character formation component, two parts, three parts and multi-part.

4, the GB in the actual list will be scaled internal code in the processing of word, speech, sentence coding.

5, the Chinese character GB sign indicating number of actual list medium-high frequency word-building part often has more than hundred (GB2312).

6, only be the example of Hanzi component list structure here, actual Hanzi component number is answered code Design person and is different, generally between 200～250.

7, real system also should comprise the tabulation of Chinese phonetic alphabet information and stroke information, constitutes complete Chinese condition code raw information tabulation (YG table).

8, the initial consonant in the Chinese phonetic alphabet information (21) tabulation, its control code data are " 01 ", and simple or compound vowel of a Chinese syllable (35) control code data are " 02 ".

9, Chinese character stroke generally adopts " horizontal, vertical, left-falling stroke, point (right-falling stroke), hook ", and five strokes are made code with numerical key 1～5, also can be arranged in virtually on the character keys by two combination of strokes (5 * 5=25 group), press the numerical key operation, by the character keys tabulation.

Subordinate list two

Hanzi features sign indicating number component information tabulation example

Annotate: 1, the space in the above-mentioned Hanzi component tabulation is " 0 ".

2, the condition code raw information tabulation here only is the example of Hanzi component tabulation, and whole list of parts has 200 with upper-part.

3, D0, D1, D2 list information (parts) respectively belong to Chinese character first, second, third parts

D3 list information belongs to Chinese character end parts, is used for last parts of mark four parts or the above Chinese character of four parts

D4 is labeled as tail piece information, is used for last parts that mark comprises character formation component, two, three, four parts or above Chinese character.

4, four code length list of parts can be compatible with three code length list of parts, and the last parts mark (D2) in the trigram long list equals the condition of " 1 ": D4D2=1 or D4D3=1 (being that D4D2 is " 1 " or D4D3 is " 1 ").

Subordinate list three

Hanzi features sign indicating number key position information components tabulation example

Annotate: 1, the space in the above-mentioned tabulation is " 0 ".

2, Hanzi component is generally selected 200～250, so on each key position often there be about 10 the Hanzi component number of actual location.

3, the Chinese character GB number of codes that actual key ranks in the table is very big, and the key that has ranks table even hundreds of more than (GB2312) arranged.

Also should comprise Chinese phonetic alphabet tabulation (being arranged at the Shift+ character keys) and Chinese character stroke tabulation (being arranged at numerical key 1～5) in 4, one complete condition code key position information lists (JG table).The person that gets used to the Pinyin information is changeable to be input Pinyin character keys, parts Shift+ character keys.

5, the condition code key position information list here only be wherein list of parts show row, be used for example in the text only.

6, condition code key position information list (JG table) is to be generated automatically by system program by the key position design to condition code raw information tabulation (YG table), and therefore the JG table here is the same with the YG table also is opening message structure.

7, the JG table equally here is the same with the YG table also to be four code lengths, when being used for the operation of three code length component codings (triple bond+SP key), be decided to be D2 the last bond order of Chinese character, the condition of D2=1 is: D4D2=1 (three parts Chinese characters) or D4D3=1 (four parts or above Chinese character).

8, D0, D1, D2 list information respectively belong to Chinese character first, second, third bond order.

D3 list information belongs to Chinese character end bond order, is used for the affiliated bond order of last parts of mark four parts or the above Chinese character of four parts.

D4 is labeled as the tail bond order, be used for mark comprise character formation component, two, three or last parts of multi-part Chinese character under bond order.

The integrated information JG table of 9, one styles of opening is the information bases that generate all kinds of words, speech, sentence coding (random coded).

Subordinate list four

Condition code key position information Chinese phonetic alphabet tabulation example

Annotate: 1, the space that above-mentioned phonetic (sound mother) key ranks in the table is " 0 ".

2, the phonetic key ranks the initial consonant that the D0 position list information of showing control code belongs to this Chinese character, and D1 position list information belongs to the simple or compound vowel of a Chinese syllable of this Chinese character.

3, information phonetic tabulation in condition code key position is to be generated automatically by system program by the key position design that phonetic (consonant, vowel) is tabulated.

4, the Chinese character GB number of codes that actual key ranks in the table is very big, and the key that has ranks table even hundreds of more than (GB2312) arranged.

5, the person that gets used to the Pinyin information is changeable to be input Pinyin information character keys, input block information Shift+ character keys.

6, the condition code key position information list here only is the row that show of Chinese phonetic alphabet information list.Be used for example in the text only.

Subordinate list five

Hanzi features sign indicating number key position information stroke tabulation example

Annotate: 1, the space that the above-mentioned pair of stroke key ranks in the table is " 0 ".

2, the stroke encoding input operation press first, second, third and fourth, five and last stroke, five strokes (horizontal, vertical, left-falling stroke, point, hook) are pressed 1～5 numerical key.

3, the stroke key ranks table and generally is positioned on 25 character keys by two strokes are virtual, and two strokes are synthesized one yard.

The Chinese character of 4, three, five stroke numbers end stroke will be made strange position mark, is marked at D3=1 on its multiple stroke key position (other stroke number need not mark).

Subordinate list six

The Hanzi component coding generates example automatically

Annotate: 1, the space that above-mentioned word code generates in the buffer zone is " 0 ".Above-mentioned parenthetic character is that Chinese character splits the affiliated key position of parts.

2, the Chinese character GB number of codes that actual word code generates in the buffer zone should be whole I and II GB Chinese characters among the GB2312.

3, above-mentioned word code generates the detection data of this encode Chinese characters for computer of dense black numeral structure in the buffer zone word GB unit:

The encoding Chinese characters of two code lengths detects data: D0, D1 are " 1 ".

The encoding Chinese characters of three code lengths detects data: D0, D1, D2 are " 1 ".

The encoding Chinese characters of four code lengths detects data: D0, D1, D2, D3 are " 1 ".

Subordinate list seven

The word component coding generates example automatically

Annotate: 1, the space that above-mentioned Chinese word coding generates in the buffer zone is " 0 ".Above-mentioned parenthetic character is that Chinese character splits the affiliated key position of parts.

2, the Chinese character GB number of codes that the actual words coding generates in the buffer zone should be whole I and II GB Chinese characters among the GB2312.

3, the dense black data position in the above-mentioned buffer zone speech GB unit makes up the Chinese word coding judgement unit, is used to differentiate the Chinese word coding of Hanzi component information:

Two word coding method characteristics: 99 (promptly D0, D3, D4, the D7 position in the Chinese word coding judgement unit of this two words is " 1 ")

Three word coding method characteristics: 95 (promptly D0, D2, D4, the D7 position in the Chinese word coding judgement unit of this three words is " 1 ")

The coding characteristic data of four words and above speech: 55 (promptly D0, D2, D4, the D6 position in the Chinese word coding judgement unit of this multi-character words is " 1 ")

Subordinate list eight

Chinese phonetic alphabet coded word, speech generate example automatically

Strokes of Chinese characters encoding word, speech generate example automatically

Annotate: 1, the space in above-mentioned word, the Chinese word coding buffer zone is " 0 ".Above-mentioned parenthetic character is the female or affiliated key position of two combination of strokes of Chinese character initial consonant.

2, the Chinese character GB number of codes in the actual word coding method buffer zone should be whole I and II GB Chinese characters among the GB2312.

3, the dense black data position in above-mentioned word, the speech buffer zone GB unit is used for the word code detection and Chinese word coding is differentiated:

The encoding Chinese characters of two code lengths detects data: D0, D1 are " 1 "

The encoding Chinese characters of three code lengths detects data: D0, D1, D2 are " 1 "

The encoding Chinese characters of four code lengths detects data: D0, D1, D2, D3 are " 1 "

Subordinate list nine

Mix the category information word code and generate example automatically

Annotate: 1, the space that above-mentioned word code generates in the buffer zone is " 0 ".Above-mentioned parenthetic character is key position under the input information.

3, user's input block information character keys of getting used to the parts input operation, the user that consonant, vowel information is got used to the phonetic input operation with the SHIFT+ character keys imports consonant, vowel information character keys, and component information is pressed 1～5 five numerical key (represent horizontal, vertical, left-falling stroke, point, hook five classes difference strokes) with SHIFT+ character keys input stroke information.

4, the dense black data position in the above-mentioned buffer zone in the word GB unit is used for word code and detects data:

Subordinate list ten

Mix the category information Chinese word coding and generate example automatically

Annotate: 1, the space that above-mentioned Chinese word coding generates in the buffer zone is " 0 ".Above-mentioned parenthetic character is key position under the encode Chinese characters for computer information.

3, the dense black data position in the last predicate buffer zone GB unit makes up the Chinese word coding judgement unit, is used to differentiate Chinese word coding:

Two word coding method characteristics: 99 (first and second bond order information of lead-in belongs to similar, and first and second bond order information of tail word also belongs to similar)

95 (first and second bond order information of lead-in belongs to the foreign peoples, and first and second bond order information of tail word belongs to similar)

59 (first and second bond order information of lead-in belongs to similar, and first and second bond order information of tail word belongs to the foreign peoples)

55 (first and second bond order information of lead-in belongs to the foreign peoples, and first and second bond order information of tail word also belongs to the foreign peoples)

Three word coding method characteristics: 95 (first and second bond order information of tail word belongs to similar, and is irrelevant with key entry information category first, secondary word)

55 (first and second bond order information of tail word belongs to the foreign peoples, and is irrelevant with key entry information category first, secondary word)

The coding characteristic data of four words and the above word of four words: 55 (irrelevant) with the key entry information category.

Claims

1. Chinese information digitalization disposal route, it is characterized in that: I and II GB Chinese character is split by corresponding decomposition standard, Chinese character information after the fractionation is carried out the condition code information list, each Hanzi features sign indicating number in the condition code information list comprises Chinese character international code (GB sign indicating number) and control code two parts, GB is a kanji code, and control code is the preface sign indicating number of list information in this Chinese character splits then.

2. Chinese information digitalization disposal route according to claim 1 is characterized in that: the decomposition standard of I and II GB Chinese character can be " Hanzi component standard ", " Chinese-character stroke standard " or " spelling scheme of Chinese character " three kinds.

3. a Chinese character random coded method is characterized in that, comprises the steps:

A) at first I and II GB Chinese character is split by " Hanzi component standard ", " Chinese-character stroke standard " or " spelling scheme of Chinese character ", the parts after the fractionation, sound mother, stroke information comprise the condition code information list of Chinese character international code and control code two parts;

C) raw information is carried out the design of key position;

D) tabulation of condition code raw information is converted into condition code key position information list;

E) set up word, Chinese word coding buffer zone;

F) store key entry information in word, speech buffer zone;

G) behind word (or speech) end key, go into word (or speech) encoding buffer according to word (or speech) encoding characteristics capturing and coding information from word (or speech) storehouse;

H) differentiate this word (or speech) according to the encoding characteristics (characteristic) of word (or speech) and whether meet encoding condition, detect coded word (or speech), encoding process finishes.

4. Chinese character random coded method according to claim 3, it is characterized in that: initial consonant (21) tabulation in the Chinese phonetic alphabet information that I and II GB Chinese character splits by " spelling scheme of Chinese character ", its control code data are " 01 ", and the control code data are " 02 " in its simple or compound vowel of a Chinese syllable (35) tabulation.

5. Chinese character random coded method according to claim 3, it is characterized in that: generally adopt by the Chinese character stroke that " Chinese-character stroke standard " splits " horizontal, vertical, left-falling stroke, point (right-falling stroke); hook " five strokes at I and II GB Chinese character, 1-5 makes code with numerical key, also can be arranged on the character keys virtually by two combination of strokes (5 * 5=25 group), press the numerical key operation, by the character keys tabulation.

6. Chinese character random coded method according to claim 3 is characterized in that: described control code comprises first part mark, inferior parts mark, the 3rd parts mark, last parts mark and tail piece mark.

7. Chinese character random coded method according to claim 6 is characterized in that: digitized Hanzi component information list adopts four code length bond orders, and four code length bond orders can be compatible with two code lengths and three code lengths.In the bond order of four code lengths, computer carries out according to the following rules to the code fetch of keying in information: the parts that occur necessarily are counted as first part for the first time, are thereafter time parts, the 3rd parts and last parts; The Pinyin information that occurs first in the quadruple linkage is regarded initial consonant as, is thereafter simple or compound vowel of a Chinese syllable; The stroke information that occurs in the quadruple linkage is followed successively by the one or two, the three or four and the 5th end pen.

8. Chinese character random coded method according to claim 6 is characterized in that: in the bond order of four code lengths, no matter be word code or Chinese word coding, but input information stochastic transformation information type and need not to switch.

9. Chinese character random coded method according to claim 6 is characterized in that: for word code, no matter be to import with category information or heterogeneous information, but the code length of stochastic transformation word code (two code lengths or three code lengths or four code lengths).

10. Chinese character random coded method according to claim 6, it is characterized in that: in the random coded system, must set up word, Chinese word coding buffer zone, its capacity comprises whole character library, respectively in order to storing the key position information of input, and in this buffer zone, differentiate the coding formation condition of word or speech.