[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN104391589B - A kind of Chinese and English Mixed design content identification method based on record of keys - Google Patents

A kind of Chinese and English Mixed design content identification method based on record of keys Download PDF

Info

Publication number
CN104391589B
CN104391589B CN201410764964.3A CN201410764964A CN104391589B CN 104391589 B CN104391589 B CN 104391589B CN 201410764964 A CN201410764964 A CN 201410764964A CN 104391589 B CN104391589 B CN 104391589B
Authority
CN
China
Prior art keywords
user
dictionary
standard
keyboard
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410764964.3A
Other languages
Chinese (zh)
Other versions
CN104391589A (en
Inventor
宋胜利
高海昌
覃桂敏
褚华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201410764964.3A priority Critical patent/CN104391589B/en
Publication of CN104391589A publication Critical patent/CN104391589A/en
Application granted granted Critical
Publication of CN104391589B publication Critical patent/CN104391589B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses the Chinese and English Mixed design content identification method based on record of keys, specific steps include that the Keyboard Message of user is converted into keyboard sequence;It reads record file and current input state is converted into standard translation format;It reads standard translation format and is searched from the specific reduction dictionary of user, standard restoration dictionary, or a best candidate item is looked for be substituted, complete reduction step;Displaying translation result is simultaneously changed.User's Keyboard Message is associated with user's keyboard sequence standard on message window by the present invention, current input state is identified with the state automatic identification algorithm based on window, improve the accuracy rate of identification, some keyboard sequences omitted for user, it is searched in dictionary, one best candidate item of selection substitutes, for single keyboard sequence string, the efficiency of lookup is improved by the lookup algorithm based on reverse dictionary entry method, the present invention corrects interface equipped with user simultaneously, and user can carry out artificial correction to the file restored.

Description

A kind of Chinese and English Mixed design content identification method based on record of keys
Technical field
The present invention relates to computer realm, specifically a kind of Chinese and English Mixed design content recognition side based on record of keys Method.
Background technology
With information-based development, computer is deep into the every aspect in people's life, and keyboard input is as main Interactive mode plays an important role in internet exchange and routine office work, but is directed to be restored according to Keyboard Message and use The problem of family is originally inputted, there are no an effective solution schemes.Field is restored in current input, there are no ripe skills Art scheme, the problem of being primarily present, embody in the following areas:
First, user is during input, may switch window at any time, and the frequency of switch window is relatively high, one As method cannot by window and input message relating get up.
Secondly, user is during input, and due to the auto-complete function of input method, user can omit some keyboard sequences Row typically omit the latter half of phonetic, result in going wrong during reduction in this way, cannot correctly match As a result.
Furthermore input method determines epidemic situation comparison difficulty, due on the market there are many input method, and cutting between input method It changes key to differ, switching method is different inside input method, and when causing to restore input by user, the state of input method, which determines, is not allowed Really.
Finally, the result accuracy rate of reduction is not high, and the probability for unisonance allograph wherein occur is relatively high.
To be the one of keyboard input reduction field if can directly or indirectly be solved the above problems by certain methods Quantum jump.
Invention content
The purpose of the present invention is to provide a kind of search efficiency height, the high Chinese and English based on record of keys of recognition accuracy Mixed design content identification method, to solve the problems mentioned in the above background technology.
To achieve the above object, the present invention provides the following technical solutions:
A kind of Chinese and English Mixed design content identification method based on record of keys, is as follows:
(1) Keyboard Message of user is converted into keyboard sequence during input by user, removed in keyboard sequence Noise information carries out merger processing, and persistence according to the number of windows input frames to user's keyboard sequence;
(2) record file is read, current input state is identified using the state automatic identification algorithm based on window, then will Recognition result is converted to standard translation format;
(3) standard translation format is read, is searched first using for the specific reduction dictionary of user, then reuses mark Quasi- reduction dictionary, uses the lookup algorithm based on reverse dictionary entry to the character string in each reference format, is turned over Translate as a result, for match less than result look for a best candidate item to substitute in dictionary, complete reduction step;
(4) translation result is showed into user, user corrects interface by user and modifies, for wherein translating not just True result and unisonance allograph is modified, and these modifications are added in the specific reduction dictionary of user, and preservation most terminates Fruit.
Compared with prior art, the beneficial effects of the invention are as follows:
User's Keyboard Message is associated with user's keyboard sequence standard on message window by the present invention, with the shape based on window State automatic identification algorithm identify current input state, improve the accuracy rate of identification, for user omit some keyboard sequences, It is searched in dictionary, selects a best candidate item to substitute, for single keyboard sequence string, by being based on reverse dictionary entry method Lookup algorithm improve the efficiency of lookup, the present invention corrects interface equipped with user simultaneously, and user can be to restoring File carries out artificial correction.
Description of the drawings
Fig. 1 is the flow chart of the present invention.
Fig. 2 is the flow chart that Keyboard Message is converted into keyboard sequence in the present invention.
Fig. 3 is the principle schematic of the lookup algorithm based on reverse dictionary entry in the present invention.
Specific implementation mode
The technical solution of this patent is described in more detail With reference to embodiment.
- 3 are please referred to Fig.1, a kind of Chinese and English Mixed design content identification method based on record of keys, specific steps are such as Under:
(1) Keyboard Message of user is converted into keyboard sequence during input by user, removed in keyboard sequence Noise information carries out merger processing, and persistence according to the number of windows input frames to user's keyboard sequence;
(2) record file is read, current input state is identified using the state automatic identification algorithm based on window, then will Recognition result is converted to standard translation format;
The standard translation format G=WQ, is made of window number W and list entries Q, wherein:W indicates a window Mouthful number, be for identifying the keyboard list entries under the same window, thus can when window frequent switching energy It is enough that corresponding input is sat in the right seat;The list entries on window that Q expressions are identified for window number W, Q=T1, T2, T3 ..., list entries are a sequences being made of at least one input unit.
Each input unit is made of input state, character string and separator, i.e. T=[State] S [Separator], wherein:T indicates that input unit, [state] indicate that the input state of this input unit T, S indicate a word Symbol string, [Separator] indicate a separator.
Input state [state] ∈ { P, E, W }, wherein:P indicates that spelling input method, E indicate English input method, W tables Show five-stroke input method.
Character string S [i] ∈ { 0-9, a-z, A-Z }, each character in character string S [i] belong to number, capital letter One kind in capital volume lowercase.
The separator
[Separator] ∈ { carriage return character, newline, space, Shift, Tab, Cpas Lock, Esc, punctuation mark }, point It is for the input of user is separated every symbol [Separator], for each input unit, there are one unique defeated Enter state.
The principle of the state automatic identification algorithm based on window is:
In step (2), record file is read first, and the format conversion for reading record file is that standard translation format is:
G=WT1T2T3...Tn
The input state T of standard translation formati[State] is uncertain, and during common input, user is each All there are one input method status in secondary input process, but for identification during, cannot judge current input method shape State, because input method status cannot be captured during input by user, it is assumed that P(i,x)Indicate i-th input unit State is a probability value of x, and the codomain of x is { P, E, W };It may be with preceding n-1 for the input method status of each input unit The state of a input unit is related, and the distance between two input units are different, then impact factor is different, it is assumed that R(m,i)Indicate that i-th of input unit state of state pair of m-th of input unit is the impact factor of x;It can be inputted simultaneously Matching result in user thesaurus, D(i,x)Indicate that the state of i-th of input unit is the probability value of x, the codomain of x be P, E, W}。
α indicates that the input unit of front i-1 then indicates dictionary to currently defeated to currently inputting the impact factors of i states, 1- α The impact factor for entering i states, then have:P(i,x)=F(i,x)α+D(i,x)(1- α), F(i,x)The state pair of i-1 input unit before indicating I input unit states are the influence value of x,
Then have:P(i, x) x=PExpression state is the probability of spelling input method, P(i, x) x=EExpression state is the shape of English input method State, P(i, x) x=WExpression state is the probability of five-stroke input method, then the input state of i-th of input unit is:In these three values most Big x values.
In the judgement of input method input state, RmiIndicate that i-th of input unit state of state pair of m-th of input unit is The influence value of x, basically reflect the states between the i-m input unit of position two of difference to influence relationship for this influence value, Therefore a window W can be defined, is indicated for i-th of input unit, the input unit in only preceding W length ranges State is significant, can thus reduce parameter value, is equally reached ideal effect.It is after above-mentioned improvement then:
P(i,x)=F(i,x)α+D(i,x)(1-α)
R(l,y,x)Indicate that two input unit distances are l, the state of previously input unit is y, before this value reflects One input unit state is the impact factor that y is x to the latter input unit state.Then for F(i,x)For, only focus on State inside its adjacent preceding W window, does not need to pay close attention to other states.
(3) standard translation format is read, is searched first using for the specific reduction dictionary of user, then reuses mark Quasi- reduction dictionary, uses the lookup algorithm based on reverse dictionary entry to the character string in each reference format, is turned over Translate as a result, for match less than result look for a best candidate item to substitute in dictionary, complete reduction step;
(4) translation result is showed into user, user corrects interface by user and modifies, for wherein translating not just True result and unisonance allograph is modified, and these modifications are added in the specific reduction dictionary of user, and preservation most terminates Fruit.
The it is proposed of the specific reduction dictionary of the user and standard restoration dictionary be due to user's input information during make It is determined with input method input habit, when user inputs, certain particular brand input method can be used, made for a long time During, input method can record the use habit of user, and in the later input process of user, can be according to user's Input habit matching input;Standard restoration dictionary is system dictionary, similar with the standard dictionary in input method, reflects input method The dictionary of standard, the specific reduction dictionary of user are according in user's correcting module in step (4), according to the amendment knot of user Fruit builds the specific reduction dictionary of user, and each modification all can dynamically be added to the specific reduction dictionary of access customer, afterwards Reduction process in can be restored using the specific reduction dictionary of these users.
The lookup algorithm based on reverse dictionary entry:Internal storage data is that computer is effectively organized in memory The mode of dictionary entry, for every dictionary entry, structure is:
Item={ message [], results [], result_length, pointers [], pointer_length } Wherein message indicates keyboard sequence in dictionary, and the keyboard sequence in message will be reverse, and results is used for indicating to match To dictionary as a result, result_length is used to indicate that the length of dictionary result, pointers is used for indicating partial indexes, Pointer_length is used for indicating the length of partial indexes.
Results []={ vector < result > }
Pointers []={ vector < pointer > }
Index is a kind of data structure in memory, is in order to which the data that user uses more quickly are accessed, originally Index includes in the part based on input unit global index global_index [] and based on this entry character information in invention Deposit index;It is that will be incited somebody to action according to the similarities and differences of dictionary entry the first two character based on input unit global index global_index [] Its true address information real_addr in memory record, then hash function is used to identify indirect index memory That deposited inside address ind_addr, each indirect index memory address ind_addr is true address information real_addr.
Hash function is:H (k)=(int) (k-'a');
Indicate that the function of indirect index memory address is:Ind_addr=h (message [0]) * 26+h (message [1])
The function of true address is:Real_addr=global_index [ind_addr]
It is normal search dictionary be using being searched one by one in dictionary to memory, some of which lookup be it is useless, Such as the dictionary entry to be searched is:The dictionary entry of abab, comparison are acab, and latter one dictionary entry is acad, then Subsequent dictionary entry acad just need not be than right, based on the partial indexes model of this entry character information in each dictionary item It is added to a pointers [] in mesh, improves search efficiency, pointers [i] indicates to search entry check_ The preceding i-1 character of item.message is identical with the preceding i-1 character of index_item.message, is i-th of character The address of next lookup when differing.
Lookup algorithm based on reverse dictionary entry is as follows:
Input:Search entry check_item
Output:Position pos in the dictionary of place
In step (3), matched in dictionary using the lookup algorithm based on reverse dictionary entry, if looked for Less than then using the method based on natural language processing, analyze preceding part of speech analysis to input unit, syntactic analysis, sentence justice A best candidate item is selected, solves auto-complete function of the user due to input method, needs the word inputted to associate user Out to which some characters be omitted the problem of.
User's Keyboard Message is associated with user's keyboard sequence standard on message window by the present invention, with the shape based on window State automatic identification algorithm identify current input state, improve the accuracy rate of identification, for user omit some keyboard sequences, It is searched in dictionary, selects a best candidate item to substitute, for single keyboard sequence string, by being based on reverse dictionary entry method Lookup algorithm improve the efficiency of lookup, the present invention corrects interface equipped with user simultaneously, and user can be to restoring File carries out artificial correction.
The better embodiment of this patent is explained in detail above, but this patent is not limited to above-mentioned embodiment party Formula, one skilled in the relevant art within the scope of knowledge, can also be under the premise of not departing from this patent objective Various changes can be made.

Claims (1)

1. a kind of Chinese and English Mixed design content identification method based on record of keys, which is characterized in that be as follows:
(1) Keyboard Message of user is converted into keyboard sequence during input by user, removes the noise in keyboard sequence Information carries out merger processing, and persistence according to the number of windows input frames to user's keyboard sequence;
(2) record file is read, current input state is identified using the state automatic identification algorithm based on window, it then will identification As a result standard translation format is converted to;
(3) standard translation format is read, is searched first using for the specific reduction dictionary of user, then reuses standard also Former dictionary uses based on reverse character string in each standard translation format in specific reduction dictionary and standard restoration dictionary The lookup algorithm of dictionary entry, obtains translation result, for match less than result specific reduction dictionary and standard also It looks for a best candidate item to be substituted in former dictionary, completes reduction step;
(4) translation result is showed into user, user corrects interface by user and modifies, incorrect for wherein translating As a result it is modified with unisonance allograph, and these modifications is added in the specific reduction dictionary of user, preserve final result.
CN201410764964.3A 2014-12-11 2014-12-11 A kind of Chinese and English Mixed design content identification method based on record of keys Expired - Fee Related CN104391589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410764964.3A CN104391589B (en) 2014-12-11 2014-12-11 A kind of Chinese and English Mixed design content identification method based on record of keys

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410764964.3A CN104391589B (en) 2014-12-11 2014-12-11 A kind of Chinese and English Mixed design content identification method based on record of keys

Publications (2)

Publication Number Publication Date
CN104391589A CN104391589A (en) 2015-03-04
CN104391589B true CN104391589B (en) 2018-09-28

Family

ID=52609501

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410764964.3A Expired - Fee Related CN104391589B (en) 2014-12-11 2014-12-11 A kind of Chinese and English Mixed design content identification method based on record of keys

Country Status (1)

Country Link
CN (1) CN104391589B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388346A (en) * 2018-02-28 2018-08-10 山东师范大学 A kind of Intelligent input mechanism and input method based on ARM and camera

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1896923A (en) * 2005-06-13 2007-01-17 余可立 Method for inputting English Bashu railing Chinese morphology translation intermediate text by computer
CN101403947A (en) * 2008-11-19 2009-04-08 黄庆传 Computer word input method, its keyboard and mouse
CN103399766A (en) * 2013-07-29 2013-11-20 百度在线网络技术(北京)有限公司 Method and device for updating input method system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8713464B2 (en) * 2012-04-30 2014-04-29 Dov Nir Aides System and method for text input with a multi-touch screen

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1896923A (en) * 2005-06-13 2007-01-17 余可立 Method for inputting English Bashu railing Chinese morphology translation intermediate text by computer
CN101403947A (en) * 2008-11-19 2009-04-08 黄庆传 Computer word input method, its keyboard and mouse
CN103399766A (en) * 2013-07-29 2013-11-20 百度在线网络技术(北京)有限公司 Method and device for updating input method system

Also Published As

Publication number Publication date
CN104391589A (en) 2015-03-04

Similar Documents

Publication Publication Date Title
US7810030B2 (en) Fault-tolerant romanized input method for non-roman characters
CN106202153B (en) A kind of the spelling error correction method and system of ES search engine
CN106537370B (en) Method and system for robust tagging of named entities in the presence of source and translation errors
CN109800414B (en) Method and system for recommending language correction
CN112131920B (en) Data structure generation for table information in scanned images
CN114036930A (en) Text error correction method, device, equipment and computer readable medium
CN103049458A (en) Method and system for revising user word bank
AU2012250880A1 (en) Statistical spell checker
CN109948144A (en) A method of the Teachers ' Talk Intelligent treatment based on classroom instruction situation
CN112989806A (en) Intelligent text error correction model training method
CN103810161A (en) Method for converting Cyril Mongolian into traditional Mongolian
Jain et al. Detection and correction of non word spelling errors in Hindi language
Zelenko et al. Discriminative methods for transliteration
JP2018066800A (en) Japanese speech recognition model learning device and program
WO2014189400A1 (en) A method for diacritisation of texts written in latin- or cyrillic-derived alphabets
CN104391589B (en) A kind of Chinese and English Mixed design content identification method based on record of keys
Oprean et al. Using the Web to create dynamic dictionaries in handwritten out-of-vocabulary word recognition
Doush et al. Improving post-processing optical character recognition documents with Arabic language using spelling error detection and correction
JP4266222B2 (en) WORD TRANSLATION DEVICE, ITS PROGRAM, AND COMPUTER-READABLE RECORDING MEDIUM
CN111209724A (en) Text verification method and device, storage medium and processor
CN114548075A (en) Text processing method, text processing device, storage medium and electronic equipment
NAKJAI et al. Automatic Thai finger spelling transcription
CN104239294A (en) Multi-strategy Tibetan long sentence segmentation method for Tibetan to Chinese translation system
Mohapatra et al. Spell checker for OCR
CN115455948A (en) Spelling error correction model training method, spelling error correction method and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180928

Termination date: 20181211

CF01 Termination of patent right due to non-payment of annual fee