CN104391589B

CN104391589B - A kind of Chinese and English Mixed design content identification method based on record of keys

Info

Publication number: CN104391589B
Application number: CN201410764964.3A
Authority: CN
Inventors: 宋胜利; 高海昌; 覃桂敏; 褚华
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2014-12-11
Filing date: 2014-12-11
Publication date: 2018-09-28
Anticipated expiration: 2034-12-11
Also published as: CN104391589A

Abstract

The invention discloses the Chinese and English Mixed design content identification method based on record of keys, specific steps include that the Keyboard Message of user is converted into keyboard sequence；It reads record file and current input state is converted into standard translation format；It reads standard translation format and is searched from the specific reduction dictionary of user, standard restoration dictionary, or a best candidate item is looked for be substituted, complete reduction step；Displaying translation result is simultaneously changed.User's Keyboard Message is associated with user's keyboard sequence standard on message window by the present invention, current input state is identified with the state automatic identification algorithm based on window, improve the accuracy rate of identification, some keyboard sequences omitted for user, it is searched in dictionary, one best candidate item of selection substitutes, for single keyboard sequence string, the efficiency of lookup is improved by the lookup algorithm based on reverse dictionary entry method, the present invention corrects interface equipped with user simultaneously, and user can carry out artificial correction to the file restored.

Description

A kind of Chinese and English Mixed design content identification method based on record of keys

Technical field

The present invention relates to computer realm, specifically a kind of Chinese and English Mixed design content recognition side based on record of keys Method.

Background technology

With information-based development, computer is deep into the every aspect in people's life, and keyboard input is as main Interactive mode plays an important role in internet exchange and routine office work, but is directed to be restored according to Keyboard Message and use The problem of family is originally inputted, there are no an effective solution schemes.Field is restored in current input, there are no ripe skills Art scheme, the problem of being primarily present, embody in the following areas：

First, user is during input, may switch window at any time, and the frequency of switch window is relatively high, one As method cannot by window and input message relating get up.

Secondly, user is during input, and due to the auto-complete function of input method, user can omit some keyboard sequences Row typically omit the latter half of phonetic, result in going wrong during reduction in this way, cannot correctly match As a result.

Furthermore input method determines epidemic situation comparison difficulty, due on the market there are many input method, and cutting between input method It changes key to differ, switching method is different inside input method, and when causing to restore input by user, the state of input method, which determines, is not allowed Really.

Finally, the result accuracy rate of reduction is not high, and the probability for unisonance allograph wherein occur is relatively high.

To be the one of keyboard input reduction field if can directly or indirectly be solved the above problems by certain methods Quantum jump.

Invention content

The purpose of the present invention is to provide a kind of search efficiency height, the high Chinese and English based on record of keys of recognition accuracy Mixed design content identification method, to solve the problems mentioned in the above background technology.

To achieve the above object, the present invention provides the following technical solutions：

A kind of Chinese and English Mixed design content identification method based on record of keys, is as follows：

(1) Keyboard Message of user is converted into keyboard sequence during input by user, removed in keyboard sequence Noise information carries out merger processing, and persistence according to the number of windows input frames to user's keyboard sequence；

(2) record file is read, current input state is identified using the state automatic identification algorithm based on window, then will Recognition result is converted to standard translation format；

(3) standard translation format is read, is searched first using for the specific reduction dictionary of user, then reuses mark Quasi- reduction dictionary, uses the lookup algorithm based on reverse dictionary entry to the character string in each reference format, is turned over Translate as a result, for match less than result look for a best candidate item to substitute in dictionary, complete reduction step；

(4) translation result is showed into user, user corrects interface by user and modifies, for wherein translating not just True result and unisonance allograph is modified, and these modifications are added in the specific reduction dictionary of user, and preservation most terminates Fruit.

Compared with prior art, the beneficial effects of the invention are as follows：

User's Keyboard Message is associated with user's keyboard sequence standard on message window by the present invention, with the shape based on window State automatic identification algorithm identify current input state, improve the accuracy rate of identification, for user omit some keyboard sequences, It is searched in dictionary, selects a best candidate item to substitute, for single keyboard sequence string, by being based on reverse dictionary entry method Lookup algorithm improve the efficiency of lookup, the present invention corrects interface equipped with user simultaneously, and user can be to restoring File carries out artificial correction.

Description of the drawings

Fig. 1 is the flow chart of the present invention.

Fig. 2 is the flow chart that Keyboard Message is converted into keyboard sequence in the present invention.

Fig. 3 is the principle schematic of the lookup algorithm based on reverse dictionary entry in the present invention.

Specific implementation mode

The technical solution of this patent is described in more detail With reference to embodiment.

- 3 are please referred to Fig.1, a kind of Chinese and English Mixed design content identification method based on record of keys, specific steps are such as Under：

The standard translation format G=WQ, is made of window number W and list entries Q, wherein：W indicates a window Mouthful number, be for identifying the keyboard list entries under the same window, thus can when window frequent switching energy It is enough that corresponding input is sat in the right seat；The list entries on window that Q expressions are identified for window number W, Q=T1, T2, T3 ..., list entries are a sequences being made of at least one input unit.

Each input unit is made of input state, character string and separator, i.e. T=[State] S [Separator], wherein：T indicates that input unit, [state] indicate that the input state of this input unit T, S indicate a word Symbol string, [Separator] indicate a separator.

Input state [state] ∈ { P, E, W }, wherein：P indicates that spelling input method, E indicate English input method, W tables Show five-stroke input method.

Character string S [i] ∈ { 0-9, a-z, A-Z }, each character in character string S [i] belong to number, capital letter One kind in capital volume lowercase.

The separator

[Separator] ∈ { carriage return character, newline, space, Shift, Tab, Cpas Lock, Esc, punctuation mark }, point It is for the input of user is separated every symbol [Separator], for each input unit, there are one unique defeated Enter state.

The principle of the state automatic identification algorithm based on window is：

In step (2), record file is read first, and the format conversion for reading record file is that standard translation format is：

G=WT₁T₂T₃...T_n

The input state T of standard translation format_i[State] is uncertain, and during common input, user is each All there are one input method status in secondary input process, but for identification during, cannot judge current input method shape State, because input method status cannot be captured during input by user, it is assumed that P_(i,x)Indicate i-th input unit State is a probability value of x, and the codomain of x is { P, E, W }；It may be with preceding n-1 for the input method status of each input unit The state of a input unit is related, and the distance between two input units are different, then impact factor is different, it is assumed that R_(m,i)Indicate that i-th of input unit state of state pair of m-th of input unit is the impact factor of x；It can be inputted simultaneously Matching result in user thesaurus, D_(i,x)Indicate that the state of i-th of input unit is the probability value of x, the codomain of x be P, E, W}。

α indicates that the input unit of front i-1 then indicates dictionary to currently defeated to currently inputting the impact factors of i states, 1- α The impact factor for entering i states, then have：P_(i,x)=F_(i,x)α+D_(i,x)(1- α), F_(i,x)The state pair of i-1 input unit before indicating I input unit states are the influence value of x,

Then have：P_{(i, x) x=P}Expression state is the probability of spelling input method, P_{(i, x) x=E}Expression state is the shape of English input method State, P_{(i, x) x=W}Expression state is the probability of five-stroke input method, then the input state of i-th of input unit is：In these three values most Big x values.

In the judgement of input method input state, R_miIndicate that i-th of input unit state of state pair of m-th of input unit is The influence value of x, basically reflect the states between the i-m input unit of position two of difference to influence relationship for this influence value, Therefore a window W can be defined, is indicated for i-th of input unit, the input unit in only preceding W length ranges State is significant, can thus reduce parameter value, is equally reached ideal effect.It is after above-mentioned improvement then：

P_(i,x)=F_(i,x)α+D_(i,x)(1-α)

R_(l,y,x)Indicate that two input unit distances are l, the state of previously input unit is y, before this value reflects One input unit state is the impact factor that y is x to the latter input unit state.Then for F_(i,x)For, only focus on State inside its adjacent preceding W window, does not need to pay close attention to other states.

The it is proposed of the specific reduction dictionary of the user and standard restoration dictionary be due to user's input information during make It is determined with input method input habit, when user inputs, certain particular brand input method can be used, made for a long time During, input method can record the use habit of user, and in the later input process of user, can be according to user's Input habit matching input；Standard restoration dictionary is system dictionary, similar with the standard dictionary in input method, reflects input method The dictionary of standard, the specific reduction dictionary of user are according in user's correcting module in step (4), according to the amendment knot of user Fruit builds the specific reduction dictionary of user, and each modification all can dynamically be added to the specific reduction dictionary of access customer, afterwards Reduction process in can be restored using the specific reduction dictionary of these users.

The lookup algorithm based on reverse dictionary entry：Internal storage data is that computer is effectively organized in memory The mode of dictionary entry, for every dictionary entry, structure is：

Item={ message [], results [], result_length, pointers [], pointer_length } Wherein message indicates keyboard sequence in dictionary, and the keyboard sequence in message will be reverse, and results is used for indicating to match To dictionary as a result, result_length is used to indicate that the length of dictionary result, pointers is used for indicating partial indexes, Pointer_length is used for indicating the length of partial indexes.

Results []={ vector ＜ result ＞ }

Pointers []={ vector ＜ pointer ＞ }

Index is a kind of data structure in memory, is in order to which the data that user uses more quickly are accessed, originally Index includes in the part based on input unit global index global_index [] and based on this entry character information in invention Deposit index；It is that will be incited somebody to action according to the similarities and differences of dictionary entry the first two character based on input unit global index global_index [] Its true address information real_addr in memory record, then hash function is used to identify indirect index memory That deposited inside address ind_addr, each indirect index memory address ind_addr is true address information real_addr.

Hash function is：H (k)=(int) (k-'a')；

Indicate that the function of indirect index memory address is：Ind_addr=h (message [0]) * 26+h (message [1])

The function of true address is：Real_addr=global_index [ind_addr]

It is normal search dictionary be using being searched one by one in dictionary to memory, some of which lookup be it is useless, Such as the dictionary entry to be searched is：The dictionary entry of abab, comparison are acab, and latter one dictionary entry is acad, then Subsequent dictionary entry acad just need not be than right, based on the partial indexes model of this entry character information in each dictionary item It is added to a pointers [] in mesh, improves search efficiency, pointers [i] indicates to search entry check_ The preceding i-1 character of item.message is identical with the preceding i-1 character of index_item.message, is i-th of character The address of next lookup when differing.

Lookup algorithm based on reverse dictionary entry is as follows：

Input：Search entry check_item

Output：Position pos in the dictionary of place

In step (3), matched in dictionary using the lookup algorithm based on reverse dictionary entry, if looked for Less than then using the method based on natural language processing, analyze preceding part of speech analysis to input unit, syntactic analysis, sentence justice A best candidate item is selected, solves auto-complete function of the user due to input method, needs the word inputted to associate user Out to which some characters be omitted the problem of.

The better embodiment of this patent is explained in detail above, but this patent is not limited to above-mentioned embodiment party Formula, one skilled in the relevant art within the scope of knowledge, can also be under the premise of not departing from this patent objective Various changes can be made.

Claims

1. a kind of Chinese and English Mixed design content identification method based on record of keys, which is characterized in that be as follows：

(1) Keyboard Message of user is converted into keyboard sequence during input by user, removes the noise in keyboard sequence Information carries out merger processing, and persistence according to the number of windows input frames to user's keyboard sequence；

(2) record file is read, current input state is identified using the state automatic identification algorithm based on window, it then will identification As a result standard translation format is converted to；

(3) standard translation format is read, is searched first using for the specific reduction dictionary of user, then reuses standard also Former dictionary uses based on reverse character string in each standard translation format in specific reduction dictionary and standard restoration dictionary The lookup algorithm of dictionary entry, obtains translation result, for match less than result specific reduction dictionary and standard also It looks for a best candidate item to be substituted in former dictionary, completes reduction step；

(4) translation result is showed into user, user corrects interface by user and modifies, incorrect for wherein translating As a result it is modified with unisonance allograph, and these modifications is added in the specific reduction dictionary of user, preserve final result.