CN103885938A - Industry spelling mistake checking method based on user feedback - Google Patents
Industry spelling mistake checking method based on user feedback Download PDFInfo
- Publication number
- CN103885938A CN103885938A CN201410149427.8A CN201410149427A CN103885938A CN 103885938 A CN103885938 A CN 103885938A CN 201410149427 A CN201410149427 A CN 201410149427A CN 103885938 A CN103885938 A CN 103885938A
- Authority
- CN
- China
- Prior art keywords
- word
- corpus
- user
- industry
- dictionary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 10
- 238000007689 inspection Methods 0.000 claims description 39
- 239000000463 material Substances 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000001419 dependent effect Effects 0.000 claims description 7
- 238000012937 correction Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 5
- 229940060321 after-bug Drugs 0.000 claims description 2
- 230000008878 coupling Effects 0.000 claims description 2
- 238000010168 coupling process Methods 0.000 claims description 2
- 238000005859 coupling reaction Methods 0.000 claims description 2
- 230000007474 system interaction Effects 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 17
- 238000007619 statistical method Methods 0.000 description 7
- 238000013179 statistical model Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 241001086826 Branta bernicla Species 0.000 description 1
- 241001074085 Scophthalmus aquosus Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses an industry spelling mistake checking method based on user feedback. According to the industry spelling mistake checking method based on user feedback, spelling mistake checking is carried out on English text by using an N-gram method and a user dictionary which is designed in a classified mode, recommendation of correct words is accomplished by searching for a large corpus database, and thus checking of spelling mistakes related to a user is achieved. The N-gram method serves as a basic method for natural language processing, and the mistakes in the text are checked according to the characteristics of words or statements and statistical information in a corpus; recommended words which are most related to wrong words in the text input by the user are selected through cooperation between the user dictionary designed in the classified mode and statistical data of the corpus according to historical information of the user at present; the database is searched for a word chain with the largest conditional probability product by using the Viterbi algorithm, and computational efficiency of a hidden Markov model in the large corpus and use efficiency of the statistical information in the database are improved.
Description
Technical field
The present invention is a kind of English spelling error check method, has utilized the correlation techniques such as the corpus, natural language statistical model and the Hidden Markov Model (HMM) that comprise a large amount of language messages, relate to natural language processing particularly English spelling check field.
Background technology
First the abbreviation of using in the present invention is defined:
NLP(Natural Language Processing): natural language processing;
BNC(British National Corpus): British National Corpus;
LDC(Linguistic Data Consortium): language data alliance;
LD(Levenshtein Distance): editing distance;
N-gram:N metagrammar.
Misspelling checks that (Spelling Checker) is important branch and the basic link of NLP, it is inerrancy and intelligible text by natural language processing, has natural supporting role for senior NLP technology such as mechanical translation, phonetic synthesis, speech recognitions.Meanwhile, this technology can effectively improve the friendly of user interface and intelligent, has important actual application value.
Early stage NLP mainly adopts the method based on syntax-semantic rules.Along with the emergence of Corpus Construction and corpus linguistics, the main target that is treated as natural language processing of extensive real text.Rule-based method, after development for many years, still can not break through the restriction of accuracy rate and efficiency two aspects, and statistical method shows gradually in the more advantage of natural language processing field.In natural language processing, use more and more the Auto-learning Method based on statistics to obtain linguistry, this is also including misspelling inspection.Method based on statistics relates generally to corpus and two aspects of statistical language model.
Multiple tissues and research institution provide corpus and various statistics thereof separately, and this free e-books more than 4200 of providing as Chinese and English news category language material, BNC, LDC, the Gutenberg project of Text Classification research, ten thousand pieces are randomly drawed paper Chinese DBLP resource, UCI evaluates sorting data etc.
The Brants of Google and Franz have carried out element by web page text by the mode of Penn Treebank, have altogether produced the data that exceed 1T, and detailed content is as shown in table 1.The 5-grams corpus based on 1T web page text data that Google announces is the current more comprehensive English corpus of ratio based on statistical method.This corpus provides the statistical information from 1~5-grams, for the natural language processing based on statistical method provides abundant analysis Data Source.
Corpus aspect, dictionary, for word error correction provides the most basic non-word bug check ability, designs and has good management interface, extendible normal dictionary, can the basic function of word detection is provided and improve system performance for user; Support that the corpus of statistical method is to realize the basis that misspelling checks, it provides the data available that scale is considerable, information is full and accurate for Natural Language Processing Models; Corpus based on semantic is the good model that professional domain is divided, but due to the poor efficiency of syntax rule, this method cannot obtain practicality.Need to adopt statistical method indirectly to realize the corpus of trade classification.
Traditional misspelling inspection pays attention to solve the non-word bug check that correct word is input as to invalid words, and conventional method is to use a reliable dictionary and definite distance measure, as LD.Owing to manually setting up, the cost of reliable dictionary is very high, and the dictionary that traditional spell check is used is smaller.Along with statistical model is introduced in misspelling, error model and N-gram language model become the key components of misspelling check system.Kukich proposes transition matrix and the application of proper vector in spelling error correction of error probability, is the basis that N-gram method realizes afterwards.Brill and Moore have proved that a good statistical model is the key that improves spell check precision, need to do a large amount of manual markings to error correction phrase but set up such error model, and this relates to high cost.The use Web texts such as Whitelaw have improved this efficiency to a certain extent.Along with the development of Web technology and application, misspelling inspection also more and more receives publicity, and more misspelling type is mentioned, as fail to write, the wrong letter that increases, exchange the order of some letters, the merging of mistake, split word, misuse word etc.The problem that these methods mainly solve is search input error, search word candidate space and set up word candidate score function.
In existing misspelling inspection model, major part is all the off-line model based on N-gram model, and this method has become the main flow of spell check research now.The main thought of model is the statistical information using in the Bayesian formula calculating natural language of expanding, and maximum feature is to have adopted statistical method, model simply efficient.The instrument that current research mainly uses is Bayesian formula and the Hidden Markov Model (HMM) of N-gram model, expansion.Be divided into and add up word probability, use Hidden Markov Model (HMM) to solve these aspects of rapid solving of Hidden Markov Model (HMM) in N-gram model parameter and Bayesian formula with Bayesian formula.The efficiency of model and practicality are this field problems in the urgent need to address.
Summary of the invention
Technical matters: in misspelling check system, corpus is as the basis of whole model, and calculating wherein and query script inevitably become the performance bottleneck of whole system.If corpus based on syntax rule or only add up the frequency that word occurs, is easy in query script to occur that the performance result of calculation low or that cause because of statistics deficiency that rules explosion causes is inaccurate.Misspelling inspection model aspect, simply according to a certain estimate mate or only adopt N-gram computation model, there is larger error in the check result that the former obtains, the latter produces larger impact to the performance of system.The technical problem to be solved in the present invention is that system lacks the dynamic adjustment capability based on user feedback, effectively the multiple corpus information of Integrated using.For the problem that can not effectively utilize multiple corpus, adopt user dictionary, industry corpus and core corpus mutually combines, the method for weighted calculation.This method has inquiry fast, and result of calculation is accurate, to context environmental adaptability high, can under different users and text environments, automatically regulate the use of corpus to different piece, effectively improves system effectiveness and guarantees result accuracy.The present invention, by using viterbi algorithm to calculate the Markov chain in N-gram model, obtains the set that most possible correct word forms.In corpus, according to N-1 word before wrong word, each possible word is carried out to the calculating of probability, estimate with word and calculate weights in the residing part of corpus according to LD, obtain according to the recommendation list of the probability of occurrence sequence of correct word.Correct word and the context chosen according to user, enter the Information Statistics in user version in the corpus of system.System obtains after new statistical information, according to the statistic algorithm in N-gram model, the word frequency and conditional probability to relative recording in corpus tables of data are revised, corpus is synchronizeed with user's actual use, record the statistics of all history texts, complete the whole updating of misspelling check system.
Technical scheme:
For solving the problems of the technologies described above, the present invention utilizes N-gram corpus data and relevant statistical method, has proposed a kind of industry misspelling inspection method based on user feedback.This misspelling inspection method is specific as follows:
An industry misspelling inspection method based on user feedback, comprises step:
1) the obtaining and setting up of corpus and user dictionary:
Corpus is divided into core corpus and industry corpus, as the core statistics of storage language message, morphology, the syntactic and semantic information of in store overall statistical language and industry term, in the time carrying out misspelling inspection, core corpus and industry corpus, for spell check model provides all word, statement information, provide the global data of whole language; Meanwhile, the dictionary building voluntarily according to user, obtains the special language material information about user;
In database, definition tables of data is stored overall language material and user's language material information;
2) structure of spell check model:
The structure of misspelling inspection model is with N-gram model, the statistical information of corpus to be calculated, and obtains the word chain combination of conditional probability maximum, and step comprises:
21) correction judgement of word: the word in text is done to the coupling of core corpus, if word, not in core corpus, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be judged as wrong word, carry out next step;
22) recommendation of correct word: according in each corpus with wrong word close word under editing distance, calculate probability and the context joint probability thereof of these words, calculate and the maximally related correct word of wrong word by the weights of each corpus again, select several words of all corpus weighting posterior probability maximums to form the recommendation list of correct word;
3) recommend the text of user's input to process by the bug check in spell check model and word;
4) upgrade and user-dependent text statistical information, dictionary and corpus: the text to user's input and the correct word of selection are added up, by the correct word information in text and contextual information statistics access customer dictionary, core corpus and corresponding industry corpus.
Described step 1) in, effectively the necessary condition of corpus and user dictionary comprises:
(1) in user dictionary, not having wrong word, must be also the correct word obtaining from the recognized standard such as Oxford, Longman dictionary, or user-defined industry or special words;
(2) core corpus is enough large, does not have the skewed popularity such as industry, timeliness, and must include N-gram information, is used to provide basic word context statistical information;
(3) industry corpus carries out preliminary division according to demand, and according to user's selection Nature creating, unique user can be the user of multiple industry corpus.
Described step 21) in, use viterbi algorithm in N-gram model, to calculate fast the probability of current word in core corpus, industry language material, and obtain current word and front N-1 the joint probability that word occurs, realize the judgement to current word correctness.
Described step 22) in, use N-gram model to search in industry corpus and core corpus to the position at wrong word place, and mate in user dictionary by editing distance and word probability of occurrence, to obtain most possible word list; Probability for each word in different corpus, adopts probability-weighted to recommending word list to sort, and so rear line provides the recommendation results after sequence.
Described step 4) in, system is carried out after bug check the text of user's input, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data is upgraded, provide bug check service by new corpus data and user dictionary.
In this method, the statistical language model using is exactly to use Hidden Markov Model (HMM) to check to make in corpus using the highest word of the context dependent word chain probability of occurrence of wrong word position as correct word list, each corpus has different weights, the weighted calculation of probability and corpus by word in corpus, the recommendation word list after being sorted.Misspelling inspection completes the selection of recommending word by user.
This method, based on N-gram Natural Language Processing Models, adopts core corpus, presses corpus, user dictionary and the statistical language model of trade classification, the function that the text of inputting for user provides bug check and correct word to recommend.Input after one section of text user, server carries out element to text, is the word chain set under the N unit syntax, thereby calculates the conditional probability of last word in corpus in each word chain by text dividing.Statistical language model calculates the word of several maximum probabilities as the alternative set of correct word, if former word in alternative set, judges that former word is correct, otherwise user selects a word as correct word from alternative set.
The present invention is directed to efficiency and the practicality problem of misspelling check system, utilize the mode of classification corpus weighted calculation, estimate and searching algorithm in conjunction with LD, carry out capable of spelling words bug check in the mode of recommending after first debugging, can efficiently realize fast the word that bug check and context relation are stronger and recommend; Adopt viterbi algorithm, proposed a kind of statistical language model, can calculate fast the word list of word probability-weighted maximum in corpus in user version.Obtain the user who recommends word list, select correct word and feed back to system according to actual conditions, the word that system is selected user and context statistical information thereof join with user-dependent corpus in: calculate the more new data of this word in core corpus, industry corpus and user dictionary and add in tables of data by statistical model, with new data, the user version next time arriving is carried out to misspelling inspection, thereby realized the characteristic that system can provide misspelling to check for text according to practical service environment and different users.
Beneficial effect: the present invention has that corpus service efficiency is high, data the feature such as adjust based on user's actual feedback, makes the practical of system, and inspection speed is fast, data synchronism high (according to the service condition corpus data that upgrade in time); Be combined with multiple different corpus, can under the environment of multi-user, high concurrent request, effectively realize efficient misspelling inspection.
Accompanying drawing explanation
Fig. 1 is N-gram statistical model figure of the present invention.
Fig. 2 is misspelling check system structural drawing of the present invention.
Fig. 3 is specific embodiment of the invention process flow diagram.
Fig. 4 is misspelling audit function module map.
Fig. 5 is Google1T N-gram data message table.
Embodiment
Below in conjunction with accompanying drawing, the present invention is further described in more detail with concrete example.
Industry misspelling inspection method based on user feedback of the present invention, mainly solve the problem that lacks user-association and fast search Big-corpus in current misspelling inspection, relate to the correlation techniques such as natural language processing, user dictionary design and database search.The method is utilized the user dictionary of classification design, adopts N-gram method to carry out misspelling inspection to English text, and completes the recommendation of correct word by large language material database search, thereby realize the misspelling inspection being associated with user.N-gram model (Fig. 1), as the basic skills of natural language processing, checks the mistake in text by the statistical information in word or statement feature and corpus; The user dictionary of classification design is according to current user's historical information, in conjunction with the statistics of corpus select with user input text in the maximally related recommendation word of wrong word; Use viterbi algorithm to find out the word chain of database conditional probability product maximum, the service efficiency of statistical information in the counting yield of Hidden Markov Model (HMM) and database in raising Big-corpus.The structure of whole system and the functional module of each several part are divided as shown in Figure 2, Figure 4 shows, are below the description of design concept and the implementation detail of each several part.
1, the obtaining and setting up of corpus and user dictionary:
Corpus is divided into core corpus and industry corpus, as the core statistics of storage language message, morphology, the syntactic and semantic information of in store overall statistical language and industry term, in the time carrying out misspelling inspection, corpus, for spell check model provides all word, statement information, provides the global data of statistical language; Meanwhile, the dictionary building voluntarily according to user, obtains the special language material information about user, and its historical information of the text entry of inputting by counting user; In database, definition tables of data is stored each corpus and user's input information.Concrete list structure is as follows:
(1) user dictionary list structure
(2) monobasic data list structure
(3) binary data list structure
(4) industry language material list structure
(5) weights data list structure
2, the structure of misspelling inspection model:
The structure of misspelling inspection model is with N-gram model, the statistical information of corpus to be calculated, and obtains the word chain combination of weighting conditional probability maximum in each corpus.In model construction process, misspelling inspection to be accurately to judge that the correctness of each word and the practicality of recommendation list are as target, do not increase the complexity of Data Matching and sequence in computation process simultaneously.According to overall statistical model, user's request and text message, use all language material data, find out all possible word probability-weighted.Consider the probability size in word compiling distance, each corpus, according to weight calculation, produce an optimum recommendation list of current word.By spell check model, the text of user's input is carried out to bug check and word recommendation.Specifically as shown in Figure 3.This model is specifically divided into two stages:
A) the best candidate set of generation word
The specific definition of word probability of occurrence: if when N=3, be examined the first two words of word word in text and be respectively word1 and word2, in corpus, get four-tuple (word1, word2, word, COUNT), calculate the ratio of COUNT sum in COUNT and whole corpus, calculate the probability of word word; Word1, word2 represents the first two word of current word word in text; If word is second word in statement, word1=' # '; If word is first word in statement, word1=word2=' # '; COUNT represents the appearance total degree of this combinations of words in corpus.
Effectively essential satisfied following 3 conditions of corpus and user dictionary:
(1) in user dictionary, not having wrong word, must be also the correct word obtaining from the recognized standard such as Oxford, Longman dictionary, or the special string inputted voluntarily of user.Now N=1, the word probability of this part is calculated in conjunction with the editing distance of the COUNT in two tuples (word, COUNT) and word itself;
(2) core corpus is enough large, and guarantees that the probability calculation in model has statistical significance.Data volume size based in corpus, we do not calculate the N-gram data of COUNT<=200 regulation.There is not the skewed popularity such as industry, timeliness, in user's use procedure each time, all the statistical information of its text is added in core corpus;
(3) according to application demand, industry corpus is tentatively divided into several large classes, and can generates new industry or combination according to user's use, be also that the bussiness field in industry corpus constantly expands along with using.
First with dictionary, the word in text is mated to judge its correctness, if word, not in user dictionary,, according to the close word in editing distance Dictionary of Computing, obtains word candidate set; Secondly to the each word in word candidate set, the top n word in conjunction with it in text, uses industry corpus, core corpus calculating probability formula successively
According to the weights W of user dictionary, industry corpus and core corpus
h, W
p, W
ccalculate final word weights W
w.Wherein, p
1, p
2, p
3be respectively the probability of occurrence of word W in user dictionary, industry corpus and core corpus; W
h+ W
p+ W
c=1, weight is calculated according to the situation of calling to each corpus in user's use procedure, W when initial
h=W
p=W
c, reject weights and be less than threshold value W
tword after, obtain word candidate set.
W
W=W
H*p
1+W
P*p
2+W
C*p
3
Table 1 is the algorithm false code that word weights calculate:
B) recommend correct word
If former word is present in word candidate set, again determine that it is correct word; Otherwise, according to the probability-weighted value of corpus weights and word chain joint probability calculation word candidate, according to probability-weighted value to word candidate set sort, by sequence after word form recommendation list send to user.
3, renewal and user-dependent text statistical information, dictionary and corpus
After user obtains word and recommends and selected correct word, user input text also becomes a part for corpus, system is added up the amended text of user, by the N-gram Information Statistics access customer dictionary in correct text, core corpus and corresponding industry corpus, concrete occurrence number and context data are increased in these tables of data.And choose the corpus at correct word place according to user, recalculate the weights W of this user in the time calling each corpus
h, W
p, W
c.
The present invention also can have other numerous embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art can make according to the present invention various corresponding changes and distortion, and these change and be out of shape the protection domain that all should belong to the appended claim of the present invention accordingly.
Claims (9)
1. the industry misspelling inspection method based on user feedback, is characterized in that, comprises step:
The obtaining and setting up of step 1, corpus and user dictionary:
Corpus is divided into user dictionary, core corpus and industry corpus, as the core statistics of storage language message, the morphology of in store whole language, syntactic and semantic information, in the time carrying out misspelling inspection, corpus, for misspelling inspection model provides all word, statement information, provides the global data of whole language; Meanwhile, according to text and the service condition of user's input, obtain the new language material information about user, upgrade corpus and user dictionary;
In database, definition tables of data is stored overall language material and user's input information;
The structure of step 2, misspelling inspection model:
The structure of misspelling inspection model is with N-gram model, the statistical information of corpus to be calculated, and obtains the word chain combination of conditional probability maximum;
Step 3, system interaction interface are by being used bug check and word in misspelling inspection model to recommend the text of user's input to process;
Step 4, renewal and user-dependent text statistical information, dictionary and corpus: the correct word of the input to user and selection is added up, by the word information in correct text and context statistics access customer dictionary, core corpus and corresponding industry corpus.
2. the industry misspelling inspection method based on user feedback according to claim 1, is characterized in that, in described step 1, effectively the necessary condition of corpus and user dictionary comprises:
(1) in dictionary, not having wrong word, must be also the correct word obtaining from the recognized standard such as Oxford, Longman dictionary, and user-defined industry or special words;
(2) core corpus is enough large, does not have industry, timeliness skewed popularity, and must include N-gram information, is used to provide basic word chain statistical information;
(3) industry corpus carries out preliminary division according to demand, and according to user's selection Nature creating, certain user can be the user of multiple industry corpus;
(4) user dictionary is according to the dictionary of user's input demand structure, can allow user manage voluntarily.
3. the industry misspelling inspection method based on user feedback according to claim 1, is characterized in that, in described step 2, specifically comprises:
The correction judgement of step 2.1 word: the word in text is done to the coupling of normal dictionary, if word, not in normal dictionary, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be judged as wrong word, carry out next step;
The recommendation of the correct word of step 2.2: according to editing distance and word chain joint probability, adopt the maximally related correct word of each corpus weighted calculation and wrong word, several words of Selection and Constitute maximum probability form the recommendation list of wrong word.
4. the industry misspelling inspection method based on user feedback according to claim 3, it is characterized in that, in described step 2.1, use the viterbi algorithm current word of Rapid matching probability of occurrence in each corpus in N-gram model, and obtain current word and front N-1 the joint probability that word occurs, realize the judgement to current word correctness.
5. the industry misspelling inspection method based on user feedback according to claim 3, is characterized in that, in described step 2.2, by editing distance and word probability of occurrence, recommendation word list is sorted, and so rear line provides recommendation results; The weights of word list of being used for sorting are that the probability in each corpus is weighted acquisition to word.
6. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, in described step 4, text to user's input carries out after bug check, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data is upgraded, provide bug check service by new corpus data and dictionary.
7. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, use Hidden Markov Model (HMM) to check to make in corpus using the highest word of the context dependent word chain probability of occurrence of wrong word position as correct word list, each corpus has different weights, the weighted calculation of probability and corpus by word in corpus, the recommendation word list after being sorted; Misspelling inspection completes the selection of recommending word by user.
8. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, adopt user dictionary, core corpus, industry corpus and statistical language model, input after one section of text user, server carries out element to text, be the word chain set under the N unit syntax by text dividing, thereby calculate the conditional probability of last word in corpus in each word chain; Statistical language model calculates the word of several maximum probabilities as the alternative set of correct word, if former word in alternative set, judges that former word is correct, otherwise user selects a word as correct word from alternative set.
9. the industry misspelling inspection method based on user feedback according to claim 8, is characterized in that, statistical language model adopts viterbi algorithm, calculates the word list of word probability-weighted maximum in corpus in user version; Obtain and recommend the user of word list, according to actual conditions select correct word and context statistical information thereof join with user-dependent corpus in; Calculate the more new data of this word in user dictionary, core corpus and industry corpus and add in tables of data by statistical language model, with new data, the user version next time arriving being carried out to misspelling inspection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410149427.8A CN103885938B (en) | 2014-04-14 | 2014-04-14 | Industry spelling mistake checking method based on user feedback |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410149427.8A CN103885938B (en) | 2014-04-14 | 2014-04-14 | Industry spelling mistake checking method based on user feedback |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103885938A true CN103885938A (en) | 2014-06-25 |
CN103885938B CN103885938B (en) | 2015-04-22 |
Family
ID=50954833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410149427.8A Expired - Fee Related CN103885938B (en) | 2014-04-14 | 2014-04-14 | Industry spelling mistake checking method based on user feedback |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103885938B (en) |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104112447A (en) * | 2014-07-28 | 2014-10-22 | 科大讯飞股份有限公司 | Method and system for improving statistical language model accuracy |
CN105206267A (en) * | 2015-09-09 | 2015-12-30 | 中国科学院计算技术研究所 | Voice recognition error correction method with integration of uncertain feedback and system thereof |
CN105654955A (en) * | 2016-03-18 | 2016-06-08 | 华为技术有限公司 | Voice recognition method and device |
CN106294325A (en) * | 2016-08-11 | 2017-01-04 | 海信集团有限公司 | The optimization method and device of spatial term statement |
CN106528616A (en) * | 2016-09-30 | 2017-03-22 | 厦门快商通科技股份有限公司 | Language error correcting method and system for use in human-computer interaction process |
CN106708893A (en) * | 2015-11-17 | 2017-05-24 | 华为技术有限公司 | Error correction method and device for search query term |
CN107122346A (en) * | 2016-12-28 | 2017-09-01 | 平安科技(深圳)有限公司 | The error correction method and device of a kind of read statement |
CN107291775A (en) * | 2016-04-11 | 2017-10-24 | 北京京东尚科信息技术有限公司 | The reparation language material generation method and device of error sample |
CN107291730A (en) * | 2016-03-31 | 2017-10-24 | 阿里巴巴集团控股有限公司 | Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word |
CN107305542A (en) * | 2016-04-21 | 2017-10-31 | 珠海金山办公软件有限公司 | A kind of spell checking methods and device |
CN107357775A (en) * | 2017-06-05 | 2017-11-17 | 百度在线网络技术(北京)有限公司 | The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence |
WO2018103128A1 (en) * | 2016-12-09 | 2018-06-14 | Hong Kong Applied Science and Technology Research Institute Company Limited | System and method for organizing and processing feature based data structures |
CN108628827A (en) * | 2018-04-11 | 2018-10-09 | 广州视源电子科技股份有限公司 | Candidate word evaluation method and device, computer equipment and storage medium |
CN109033065A (en) * | 2018-06-01 | 2018-12-18 | 昆明理工大学 | A kind of English- word spelling inspection method |
CN109145287A (en) * | 2018-07-05 | 2019-01-04 | 广东外语外贸大学 | Indonesian word error-detection error-correction method and system |
CN109542247A (en) * | 2018-11-14 | 2019-03-29 | 腾讯科技(深圳)有限公司 | Clause recommended method and device, electronic equipment, storage medium |
CN110020432A (en) * | 2019-03-29 | 2019-07-16 | 联想(北京)有限公司 | A kind of information processing method and information processing equipment |
CN110073349A (en) * | 2016-12-15 | 2019-07-30 | 微软技术许可有限责任公司 | Consider the word order suggestion of frequency and formatted message |
CN110147546A (en) * | 2019-04-03 | 2019-08-20 | 苏州驰声信息科技有限公司 | A kind of syntactic correction method and device of Oral English Practice |
US10402435B2 (en) | 2015-06-30 | 2019-09-03 | Microsoft Technology Licensing, Llc | Utilizing semantic hierarchies to process free-form text |
CN110489723A (en) * | 2019-08-19 | 2019-11-22 | 绍兴数纺科技有限公司 | A kind of data error detection and error correction system of dyeing information system |
CN110532572A (en) * | 2019-09-12 | 2019-12-03 | 四川长虹电器股份有限公司 | Spell checking methods based on the tree-like naive Bayesian of TAN |
CN110600011A (en) * | 2018-06-12 | 2019-12-20 | 中国移动通信有限公司研究院 | Voice recognition method and device and computer readable storage medium |
US10679008B2 (en) | 2016-12-16 | 2020-06-09 | Microsoft Technology Licensing, Llc | Knowledge base for analysis of text |
CN111259654A (en) * | 2018-11-30 | 2020-06-09 | 北京嘀嘀无限科技发展有限公司 | Text error detection method and device |
CN111523532A (en) * | 2020-04-14 | 2020-08-11 | 广东小天才科技有限公司 | Method for correcting OCR character recognition error and terminal equipment |
CN111737980A (en) * | 2020-06-22 | 2020-10-02 | 桂林电子科技大学 | Method for correcting English text word use errors |
CN111859920A (en) * | 2020-06-19 | 2020-10-30 | 北京国音红杉树教育科技有限公司 | Method and system for identifying word spelling errors and electronic equipment |
CN112328737A (en) * | 2019-07-17 | 2021-02-05 | 北方工业大学 | Spelling data generation method |
CN113095072A (en) * | 2019-12-23 | 2021-07-09 | 华为技术有限公司 | Text processing method and device |
CN113743092A (en) * | 2020-05-27 | 2021-12-03 | 阿里巴巴集团控股有限公司 | Text processing method and device, electronic equipment and computer readable storage medium |
CN118152428A (en) * | 2024-05-09 | 2024-06-07 | 烟台海颐软件股份有限公司 | Prediction and enhancement method and device for query instruction of electric power customer service system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060015326A1 (en) * | 2004-07-14 | 2006-01-19 | International Business Machines Corporation | Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building |
CN102298577A (en) * | 2011-09-21 | 2011-12-28 | 深圳市万兴软件有限公司 | Method and device for detecting spelling of document edition |
CN102937949A (en) * | 2012-10-15 | 2013-02-20 | 福建榕基软件股份有限公司 | Method and system for checking English spelling in rich text editor |
-
2014
- 2014-04-14 CN CN201410149427.8A patent/CN103885938B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060015326A1 (en) * | 2004-07-14 | 2006-01-19 | International Business Machines Corporation | Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building |
CN102298577A (en) * | 2011-09-21 | 2011-12-28 | 深圳市万兴软件有限公司 | Method and device for detecting spelling of document edition |
CN102937949A (en) * | 2012-10-15 | 2013-02-20 | 福建榕基软件股份有限公司 | Method and system for checking English spelling in rich text editor |
Cited By (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104112447A (en) * | 2014-07-28 | 2014-10-22 | 科大讯飞股份有限公司 | Method and system for improving statistical language model accuracy |
CN104112447B (en) * | 2014-07-28 | 2017-08-25 | 安徽普济信息科技有限公司 | Method and system for improving accuracy of statistical language model |
US10402435B2 (en) | 2015-06-30 | 2019-09-03 | Microsoft Technology Licensing, Llc | Utilizing semantic hierarchies to process free-form text |
CN105206267A (en) * | 2015-09-09 | 2015-12-30 | 中国科学院计算技术研究所 | Voice recognition error correction method with integration of uncertain feedback and system thereof |
CN105206267B (en) * | 2015-09-09 | 2019-04-02 | 中国科学院计算技术研究所 | A kind of the speech recognition errors modification method and system of fusion uncertainty feedback |
CN106708893A (en) * | 2015-11-17 | 2017-05-24 | 华为技术有限公司 | Error correction method and device for search query term |
WO2017084506A1 (en) * | 2015-11-17 | 2017-05-26 | 华为技术有限公司 | Method and device for correcting search query term |
CN106708893B (en) * | 2015-11-17 | 2018-09-28 | 华为技术有限公司 | Search query word error correction method and device |
CN105654955A (en) * | 2016-03-18 | 2016-06-08 | 华为技术有限公司 | Voice recognition method and device |
CN105654955B (en) * | 2016-03-18 | 2019-11-12 | 华为技术有限公司 | Audio recognition method and device |
CN107291730B (en) * | 2016-03-31 | 2020-07-31 | 阿里巴巴集团控股有限公司 | Method and device for providing correction suggestion for query word and probability dictionary construction method |
CN107291730A (en) * | 2016-03-31 | 2017-10-24 | 阿里巴巴集团控股有限公司 | Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word |
CN107291775A (en) * | 2016-04-11 | 2017-10-24 | 北京京东尚科信息技术有限公司 | The reparation language material generation method and device of error sample |
CN107291775B (en) * | 2016-04-11 | 2020-07-31 | 北京京东尚科信息技术有限公司 | Method and device for generating repairing linguistic data of error sample |
CN107305542A (en) * | 2016-04-21 | 2017-10-31 | 珠海金山办公软件有限公司 | A kind of spell checking methods and device |
CN107305542B (en) * | 2016-04-21 | 2018-11-16 | 珠海金山办公软件有限公司 | A kind of spell checking methods and device |
CN106294325A (en) * | 2016-08-11 | 2017-01-04 | 海信集团有限公司 | The optimization method and device of spatial term statement |
CN106294325B (en) * | 2016-08-11 | 2019-01-04 | 海信集团有限公司 | The optimization method and device of spatial term sentence |
CN106528616B (en) * | 2016-09-30 | 2019-12-17 | 厦门快商通科技股份有限公司 | Language error correction method and system in human-computer interaction process |
CN106528616A (en) * | 2016-09-30 | 2017-03-22 | 厦门快商通科技股份有限公司 | Language error correcting method and system for use in human-computer interaction process |
WO2018103128A1 (en) * | 2016-12-09 | 2018-06-14 | Hong Kong Applied Science and Technology Research Institute Company Limited | System and method for organizing and processing feature based data structures |
US10127219B2 (en) | 2016-12-09 | 2018-11-13 | Hong Kong Applied Science and Technoloy Research Institute Company Limited | System and method for organizing and processing feature based data structures |
CN110073349B (en) * | 2016-12-15 | 2023-10-10 | 微软技术许可有限责任公司 | Word order suggestion considering frequency and formatting information |
CN110073349A (en) * | 2016-12-15 | 2019-07-30 | 微软技术许可有限责任公司 | Consider the word order suggestion of frequency and formatted message |
US10679008B2 (en) | 2016-12-16 | 2020-06-09 | Microsoft Technology Licensing, Llc | Knowledge base for analysis of text |
CN107122346B (en) * | 2016-12-28 | 2018-02-27 | 平安科技(深圳)有限公司 | The error correction method and device of a kind of read statement |
CN107122346A (en) * | 2016-12-28 | 2017-09-01 | 平安科技(深圳)有限公司 | The error correction method and device of a kind of read statement |
US11314921B2 (en) | 2017-06-05 | 2022-04-26 | Baidu Online Network Technology (Beijing) Co., Ltd. | Text error correction method and apparatus based on recurrent neural network of artificial intelligence |
CN107357775A (en) * | 2017-06-05 | 2017-11-17 | 百度在线网络技术(北京)有限公司 | The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence |
CN108628827A (en) * | 2018-04-11 | 2018-10-09 | 广州视源电子科技股份有限公司 | Candidate word evaluation method and device, computer equipment and storage medium |
CN109033065A (en) * | 2018-06-01 | 2018-12-18 | 昆明理工大学 | A kind of English- word spelling inspection method |
CN110600011A (en) * | 2018-06-12 | 2019-12-20 | 中国移动通信有限公司研究院 | Voice recognition method and device and computer readable storage medium |
CN110600011B (en) * | 2018-06-12 | 2022-04-01 | 中国移动通信有限公司研究院 | Voice recognition method and device and computer readable storage medium |
CN109145287A (en) * | 2018-07-05 | 2019-01-04 | 广东外语外贸大学 | Indonesian word error-detection error-correction method and system |
CN109542247A (en) * | 2018-11-14 | 2019-03-29 | 腾讯科技(深圳)有限公司 | Clause recommended method and device, electronic equipment, storage medium |
CN109542247B (en) * | 2018-11-14 | 2023-03-24 | 腾讯科技(深圳)有限公司 | Sentence recommendation method and device, electronic equipment and storage medium |
CN111259654A (en) * | 2018-11-30 | 2020-06-09 | 北京嘀嘀无限科技发展有限公司 | Text error detection method and device |
CN111259654B (en) * | 2018-11-30 | 2023-09-15 | 北京嘀嘀无限科技发展有限公司 | Text error detection method and device |
CN110020432A (en) * | 2019-03-29 | 2019-07-16 | 联想(北京)有限公司 | A kind of information processing method and information processing equipment |
CN110147546B (en) * | 2019-04-03 | 2023-05-26 | 苏州驰声信息科技有限公司 | Grammar correction method and device for spoken English |
CN110147546A (en) * | 2019-04-03 | 2019-08-20 | 苏州驰声信息科技有限公司 | A kind of syntactic correction method and device of Oral English Practice |
CN112328737B (en) * | 2019-07-17 | 2023-05-05 | 北方工业大学 | Spelling data generation method |
CN112328737A (en) * | 2019-07-17 | 2021-02-05 | 北方工业大学 | Spelling data generation method |
CN110489723A (en) * | 2019-08-19 | 2019-11-22 | 绍兴数纺科技有限公司 | A kind of data error detection and error correction system of dyeing information system |
CN110532572A (en) * | 2019-09-12 | 2019-12-03 | 四川长虹电器股份有限公司 | Spell checking methods based on the tree-like naive Bayesian of TAN |
CN113095072A (en) * | 2019-12-23 | 2021-07-09 | 华为技术有限公司 | Text processing method and device |
CN111523532A (en) * | 2020-04-14 | 2020-08-11 | 广东小天才科技有限公司 | Method for correcting OCR character recognition error and terminal equipment |
CN113743092A (en) * | 2020-05-27 | 2021-12-03 | 阿里巴巴集团控股有限公司 | Text processing method and device, electronic equipment and computer readable storage medium |
CN111859920A (en) * | 2020-06-19 | 2020-10-30 | 北京国音红杉树教育科技有限公司 | Method and system for identifying word spelling errors and electronic equipment |
CN111859920B (en) * | 2020-06-19 | 2024-06-04 | 北京国音红杉树教育科技有限公司 | Word misspelling recognition method, system and electronic equipment |
CN111737980B (en) * | 2020-06-22 | 2023-05-16 | 桂林电子科技大学 | Correction method for use errors of English text words |
CN111737980A (en) * | 2020-06-22 | 2020-10-02 | 桂林电子科技大学 | Method for correcting English text word use errors |
CN118152428A (en) * | 2024-05-09 | 2024-06-07 | 烟台海颐软件股份有限公司 | Prediction and enhancement method and device for query instruction of electric power customer service system |
Also Published As
Publication number | Publication date |
---|---|
CN103885938B (en) | 2015-04-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103885938B (en) | Industry spelling mistake checking method based on user feedback | |
US10997370B2 (en) | Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time | |
US9218390B2 (en) | Query parser derivation computing device and method for making a query parser for parsing unstructured search queries | |
Mehmood et al. | An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis | |
CN106844331A (en) | Sentence similarity calculation method and system | |
CN103314369B (en) | Machine translation apparatus and method | |
CN110348003A (en) | Method and device for extracting effective text information | |
CN117251455A (en) | Intelligent report generation method and system based on large model | |
Lee | Natural Language Processing: A Textbook with Python Implementation | |
CN104750676A (en) | Machine translation processing method and device | |
Ganji et al. | Novel textual features for language modeling of intra-sentential code-switching data | |
Ma et al. | Improving Chinese spell checking with bidirectional LSTMs and confusionset-based decision network | |
CN110750967B (en) | Pronunciation labeling method and device, computer equipment and storage medium | |
Rosner et al. | A tagging algorithm for mixed language identification in a noisy domain. | |
Melero et al. | Holaaa!! writin like u talk is kewl but kinda hard 4 NLP | |
Sharma et al. | Contextual multilingual spellchecker for user queries | |
Wu | A computational neural network model for college English grammar correction | |
CN110807096A (en) | Information pair matching method and system on small sample set | |
Long | A Grammatical Error Correction Model for English Essay Words in Colleges Using Natural Language Processing | |
Chen et al. | A topic detection method based on Semantic Dependency Distance and PLSA | |
CN104641367A (en) | Formatting module, system and method for formatting an electronic character sequence | |
Sreeram et al. | A Novel Approach for Effective Recognition of the Code-Switched Data on Monolingual Language Model. | |
Sreeram et al. | Exploiting Parts-of-Speech for improved textual modeling of code-switching data | |
CN114970541A (en) | Text semantic understanding method, device, equipment and storage medium | |
Wang | Research on cultural translation based on neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20150422 |
|
CF01 | Termination of patent right due to non-payment of annual fee |