[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111178061B - Multi-lingual word segmentation method based on code conversion - Google Patents

Multi-lingual word segmentation method based on code conversion Download PDF

Info

Publication number
CN111178061B
CN111178061B CN201911324149.4A CN201911324149A CN111178061B CN 111178061 B CN111178061 B CN 111178061B CN 201911324149 A CN201911324149 A CN 201911324149A CN 111178061 B CN111178061 B CN 111178061B
Authority
CN
China
Prior art keywords
language
data
code conversion
characters
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911324149.4A
Other languages
Chinese (zh)
Other versions
CN111178061A (en
Inventor
杜权
徐萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Yayi Network Technology Co ltd
Original Assignee
Shenyang Yayi Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Yayi Network Technology Co ltd filed Critical Shenyang Yayi Network Technology Co ltd
Priority to CN201911324149.4A priority Critical patent/CN111178061B/en
Publication of CN111178061A publication Critical patent/CN111178061A/en
Application granted granted Critical
Publication of CN111178061B publication Critical patent/CN111178061B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a multi-national-language word segmentation method based on code conversion, which comprises the following steps: 1) Data preprocessing: inputting data and a language label of a word to be segmented, filtering redundant spaces in the data and adjusting the data into a UTF-8 encoding format; 2) Loading a code conversion file: loading a code conversion resource file of a corresponding language according to the language tag input in the step 1); 3) And (3) code conversion: performing code conversion on the data by using the code conversion resource file loaded in the step 2); 4) Word segmentation: and performing word segmentation processing on the code-converted data by using symbols such as punctuations and spaces. The multi-national-language word segmentation method based on code conversion can simultaneously meet different coding characteristics of multi-national languages, performs analysis and code conversion according to the characteristics of different languages in a targeted manner, and meets the requirement that the word segmentation method can simultaneously perform word segmentation on the multi-national languages.

Description

Multi-lingual word segmentation method based on code conversion
Technical Field
The invention relates to a word segmentation method in language processing, in particular to a multi-national-language word segmentation method based on code conversion.
Background
The language is the medium of human thought communication, is the most important communication tool for people, and is generated and developed along with the human society, so that the influence on politics, economy, science and technology and even culture is inevitable. Currently, there are 5651 languages found in the world, which are distributed in different parts of the world.
According to the common features and origin relations of the speech, grammar and vocabulary of each language, the linguists divide the languages in the world into a plurality of language families, each language family comprises a plurality of languages, the languages and the languages are distributed in certain regions, and many cultural features are closely related to the languages and the languages.
The word is the smallest language unit capable of being independently used in the language, and the word is generally analyzed as a basic unit in the machine translation system, so an effective and high-quality word segmentation module is crucial to the machine translation system.
The languages of all countries in the world have unique characteristics, and the languages can be roughly divided into two types by distinguishing the languages in a word segmentation mode: one is isolated or sticky similar to Chinese and Japanese; the other is most western national languages mainly including english, words in the languages use a space as a boundary, the words are called inflected language, the space between words in a text of the inflected language can specify the boundary of the words, and the sentence can be split into a plurality of continuous word combinations in a word splitting mode by using the space as a splitting mark, so that the complete sentence is split. Therefore, for most languages in western countries, the space is used as a segmentation mark to segment the languages.
At 19 ages, european scholars research nearly one hundred languages in the world, find that there are corresponding relations and similarities among voices, vocabularies and grammar rules of some languages, and classify the languages into one class, namely the same-family languages; because there is a corresponding relationship between different language families, they are summarized as a homologous language, which is the pedigree relationship of languages. In the 20 th century, linguists further classified world languages into various language families, such as the Hinoki language family, the Tibetan language family, the Cantonese language family, and the like. However, the languages of various countries are divided into different language families, each language family has its own characteristics, there are many differences between different languages in the same language family, and there are many different encoding and writing methods in some languages, for example:
1) Vietnamese has two coding sets, one of which is an independent character, and the other is formed by combining two characters.
2) The arabic character has a plurality of expressions such as arabic language, arabic form a and arabic form B, and two kinds of coded data such as the arabic character and the arabic form B character appear in the bosch language at the same time.
3) Bulgaria belongs to south-schlavian branches of the indolo european language family, and is written by using cyrillic letters, and a large number of latin letters needing to be converted are often doped in the bulgaria.
As shown in the above situation, a word segmentation method cannot satisfy all language features simultaneously, and it is difficult to implement word segmentation functions for all languages simultaneously in the same word segmentation manner, but existing languages are of various types, and designing a unique word segmentation manner for each language is too cumbersome and impractical, so that different languages need to be learned and analyzed, and data coding conversion preprocessing is performed on the words according to the features of the languages in a targeted manner, and then the words are uniformly segmented.
The Unicode coding is a new coding scheme generated for solving the limitation of the traditional character set coding scheme, and a unique binary code is uniformly set for each character in each language so as to meet the requirements of text conversion and processing in cross-language and cross-platform. The Unicode only has one character set in the Unicode coding, thereby effectively avoiding the ambiguity of the double-byte character set, and the Unicode coding is widely applied in the information exchange field of the global scope at present. In Unicode encoding, each character block has its own encoding range based on the same standard, such as greek letters, cyrillic, amantan, etc., and each character has its own encoding interval in a specific range. FIG. 1 is a Unicode encoding section for a partial language.
In the three languages mentioned above, each language has its own coding region, but occasionally some noise data is mixed in these languages: in some languages, one character is split into a combination of two characters, such as vietnamese; some are characters from two different coding regions of the same language, such as the Persian language; some of the data include characters with other coding regions, such as bulgarian language.
The multi-national-language word segmentation method based on code conversion can unify different codes of the same language, and different expressions and writing modes of the same language are summarized and sorted together, so that the size of a training data vocabulary is effectively reduced, meanwhile, the sparse problem of training data can be effectively relieved, the quality of word segmentation results in machine translation is improved, and the quality of machine translation translated texts is optimized.
At present, a multilingual word segmentation method based on code conversion, which can meet the requirements, is not reported yet.
Disclosure of Invention
Aiming at the defects that the word segmentation method for the multi-national languages in the prior art mainly segments spaces and punctuations, is difficult to meet the requirement of segmenting words for the multi-national languages with various codes at the same time, cannot obtain high-quality word segmentation results and the like, the invention provides a word segmentation mode based on code conversion, which can meet the interconversion of the multi-national languages in a multi-coding interval.
In order to solve the technical problems, the invention adopts the technical scheme that:
the invention relates to a multi-national-language word segmentation method based on code conversion, which comprises the following steps of:
1) Data preprocessing: inputting data to be segmented and a language label, filtering redundant spaces in the data and adjusting the data into a UTF-8 encoding format;
2) Loading a code conversion file: loading a code conversion resource file of a corresponding language according to the language tag input in the step 1);
3) And (3) code conversion: performing code conversion on the data by using the code conversion resource file loaded in the step 2);
4) Word segmentation: and performing word segmentation processing on the code-converted data by using symbols such as punctuations and spaces.
In step 2), the code conversion resource file specifically analyzes different characteristics of each language, distinguishes writing modes and using habits of each language, and loads corresponding code conversion files for processing according to coding intervals and conversion requirements of each language and by using characteristics of each language.
For Vietnamese, the syllable-carrying characters in the data have two writing modes of single characters and combined characters, the Vietnamese data is subjected to coding conversion before word segmentation, the characters in the data in a non-standard coding mode are uniformly converted into corresponding standard coding characters, and the resource files of the two writing modes are loaded according to the corresponding relation of the standard coding and the non-standard coding of the same character.
Aiming at the fact that most word characters are Arabic characters in the Persian data, and a few words are coded characters in an Arabic form B, conversion rules of the Arabic characters and the Arabic form B characters in the Persian data are loaded, the characters in the Arabic form B in the Persian data are converted into common Arabic characters, and then subsequent word segmentation processing and machine translation training are carried out, so that common translation between the Persian in multiple coding sections is achieved.
Aiming at the Bulgaria language, the Bulgaria language is written by using Sirillic letters, and the cases are distinguished, wherein part of letters are similar to a Latin writing method but are different in codes, and the Bulgaria language data is mixed with a large number of Latin letters to replace the Chinese and Western Larillic letters of the Bulgaria language; and simultaneously loading a resource file for the Sirillic letters in the Bulgaria language, the confusable Latin letters corresponding to the Bulgaria letters and the corresponding relation between the Sirillic letters and the Bulgaria letters, and converting the data of the Bulgaria language according to the resource file.
In step 3), loading transcoding files of each language, and transcoding data, specifically:
301 Input language data and language tags to be processed;
302 Read the resource file corresponding to the language and load the resource file into the memory;
303 Traverse each character in the linguistic data by sentence, judge whether the present character needs code conversion;
304 If code conversion is needed, converting the characters needed to be converted according to the conversion rules of each language and the corresponding resource files;
305 Output the transcoded sentence to step 4).
In step 303), if no transcoding is needed, the next character is determined continuously until all characters in the data have been traversed, and go to step 305).
The invention has the following beneficial effects and advantages:
1. the multi-national-language word segmentation method based on code conversion can simultaneously meet different coding characteristics of multi-national languages, performs analysis and code conversion according to the characteristics of different languages in a targeted manner, and meets the requirement that the multi-national-language word segmentation can be simultaneously performed by using one word segmentation method.
2. The method of the invention analyzes and learns the characteristics of different languages, carries out code conversion on multi-language data, can solve the problem that the same language data has multiple coding modes, and can filter error coding data in the data at the same time, thereby improving the quality of the multi-language data.
3. The code conversion method provided by the invention can effectively reduce data sparseness and enhance data quality, and can also reduce the vocabulary size of data in the subsequent machine translation training process, thereby effectively improving the translation quality of machine translation.
Drawings
FIG. 1 is a diagram of a partial language code interval table;
FIG. 2 is a diagram of a partial language tag table according to the method of the present invention;
fig. 3 is a general flowchart of the multilingual word segmentation method based on transcoding according to the present invention.
Detailed Description
The invention is further elucidated with reference to the accompanying drawings.
The invention standardizes a plurality of codes by carrying out Unicode code conversion on the data with a plurality of code expression modes in the same language in a plurality of national languages, and is convenient for word segmentation system processing. Each language in the Unicode code set has a coding interval in a specific range, so that ambiguity of the character set is effectively avoided, and the coding intervals of partial languages are shown in figure 1.
Fig. 3 is a general flow chart of a multilingual word segmentation method based on transcoding, and the multilingual word segmentation method based on transcoding of the present invention specifically includes the following steps:
1) Data preprocessing: inputting data to be segmented and a language label, filtering redundant spaces in the data and adjusting the data into a UTF-8 encoding format;
2) Loading a code conversion file: loading a code conversion resource file of a corresponding language according to the language tag input in the step 1);
3) And (3) code conversion: performing code conversion on the data by using the code conversion resource file loaded in the step 2);
4) Word segmentation: and performing word segmentation processing on the code-converted data by using symbols such as punctuations and spaces.
In step 1), each language has its own corresponding language tag, and the language tags are composed of 2 to 3 English letters and used for marking the name of the language. Inputting the language tags into a word segmentation system, so that the system can conveniently identify the language and perform the next coding conversion processing, wherein part of the language tags are shown in FIG. 2;
in step 2), the code conversion resource file specifically analyzes different characteristics of each language, distinguishes writing modes and using habits of each language, and loads corresponding code conversion files for processing according to coding intervals and conversion requirements of each language and by using characteristics of each language.
Taking Vietnamese, persian and Bulgaria as examples, the three languages are characterized as follows:
step 201) Vietnamese is a tonal language that uses tones to distinguish word senses written using Latin letters, which are classified into syllabic letters and non-syllabic letters, and the standard syllabic letter is a separate character, but this character can also be written by combining a non-syllabic character with a phonetic symbol, which results in two writing modes for the syllabic character in the Vietnamese data.
By letters
Figure BDA0002327919150000041
For the purpose of example only,
Figure BDA0002327919150000042
is an independent character, belongs to Latin extended additional characters in a Unicode code set, and the standard Unicode code is IEAC; in addition, another writing method exists in Vietnamese data
Figure BDA0002327919150000043
Figure BDA0002327919150000044
Is made up of characters
Figure BDA0002327919150000045
And the pronunciation symbol ". -" are written in combination, which is equivalent to two characters in a Unicode code set. The two writing methods cannot distinguish when reading Vietnamese, but when a word segmentation system performs word segmentation on data, the non-standard coded characters can be split into two characters, so that the content of original data is changed, the original data loses the meaning of the original data, and the quality of the data is reduced. Therefore, before word segmentation, the vietnamese data needs to be subjected to code conversion, and characters in the data in a non-standard coding mode are uniformly converted into corresponding standard coding characters.
The number of Vietnamese letters is 195, most of the characters are syllable-carrying characters which can be written by combining common characters and pronunciation symbols, and therefore, vietnamese needs to load resource files of two writing modes according to the corresponding relation of standard codes and non-standard codes of the same character.
Step 202) the Bose language is composed of 28 Arabic letters and 4 newly-created Bose letters, and a large number of Arabic borrows are contained in the language. In the Unicode code set, there are three sets containing Arabic letters, arabic form A, and Arabic form B, respectively. The Arabic letters are positioned in a coding interval of 0x 0600-0 x06FF, and because of the special characteristics of the writing specification of the Arabic letters, the forms of the same letter at different positions in a word are different, so the Unicode coding set also defines two forms of coding of an Arabic form A and an Arabic form B to specify other representation methods of the Arabic letters, wherein the Arabic form B defines the distortion rule and display characters of the Persian, and the coding interval is 0xE 70-0 xFEFF.
In the existing Persian language data, most characters in the Persian language are Arabic characters, and a few words are composed of characters coded by Arabic type B. Because the two types of Persian data come from different coding intervals, machine translation training is carried out by using the two types of Persian data, the conditions of data sparseness and overlarge vocabulary table caused by different coding intervals affect the translation quality; meanwhile, the Bose data of the coding interval of the Arabic type B is less, and the performance of a machine translation system is influenced when the proportion in the training data is too small. The gaussian coded in arabic form B cannot be discarded for the purpose of communication between multiple nationalities and multiple languages, and for the purpose of protecting human language culture. Therefore, it is necessary to load a conversion rule between the arabic character and the arabic form B character in the bosch language, convert the arabic form B character in the bosch language data into a common arabic character, and then perform subsequent word segmentation processing and machine translation training, thereby implementing common translation between the bosch languages in multiple coding regions.
Step 203) the development of bulgaria language goes through three stages of ancient bulgaria language, middle ancient bulgaria language and modern bulgaria language. Modern bulgaria is written using 30 cyrillic letters and is case-specific, with some letters being similar to, but not identical to, latin writing.
For example, the letters "B" and the latin character "B" in bulgarian look the same writing, but actually the Unicode code for the letter "B" in bulgarian is 0x412, the Unicode code for the latin character "B" is 0x42, which are two completely different letters, and such letters are also many in bulgarian.
The training data of machine translation is wide in source, and a large number of Latin letters are mixed in massive Bulgaria language data crawled from the internet to replace the situation of the Chinese and Western Rier letters of Bulgaria language. In order to retain real bulgarian data and reduce the difference in the data, the bulgarian data may be first cleaned before word segmentation, and the latin letters in bulgarian may be converted into corresponding cyrillic letters.
In the coding conversion resource file of the bulgarian language, cyrillic letters in the bulgarian language, confusable latin letters corresponding to the bulgarian letters, and the corresponding relationship between the cyrillic letters and the confusable latin letters need to be loaded simultaneously, and the bulgarian language data needs to be converted according to the resource file.
In step 3), according to the analysis of the three languages in step 2), loading the transcoding file of each language, and transcoding the data, specifically:
301 Input language data and language tags to be processed;
302 Read the resource file corresponding to the language and load the resource file into the memory;
303 Traverse each character in the linguistic data by sentence, judge whether the present character needs code conversion;
304 If code conversion is needed, converting the characters needed to be converted according to the conversion rules of each language and the corresponding resource files;
305 Output the transcoding sentence to step 4);
in step 303), if no transcoding is needed, the next character is determined continuously until all characters in the data have been traversed, and go to step 305).
In the following, the specific conversion method is as follows, taking vietnamese, bosch and bulgarian as examples:
the Vietnamese loaded code conversion file comprises a standard Unicode code and a non-standard writing method, the two coding modes correspond to each other one by one, and the specific code conversion steps are as follows:
a. inputting data and language tags, wherein the tags of Vietnamese are vi, and example sentences are as follows:
Figure BDA0002327919150000062
(middle translation: exit Iran nuclear protocol in the United states)
b. Reading the Vietnamese resource file, loading the content in the Vietnamese resource file into a memory in a dictionary form corresponding to key values, and naming the Vietnamese resource file as Vi _ dit: the key of Vietnamese coding dictionary is a reading symbol combination of non-standard coding, and the corresponding value is a standard Unicode coding character;
c. traversing each character in the sentence, and judging whether the current character is a reading symbol: if the current character is not a pronunciation symbol, continuously traversing the next character; if the current character is a pronunciation symbol, go to step d. After traversing the whole sentence, turning to step e;
d.if the current character is a pronunciation symbol, combining the previous character with the current character, and judging whether the current combined character exists in Vi _ fact: if the combined letter exists, replacing the combined letter with a corresponding standard code character, and then returning to the step c to traverse the next character; if the combined character does not exist in Vi _ dit, directly returning to the step c. The combinations of phonetic symbols to be converted appearing in example sentences are respectively
Figure BDA0002327919150000063
And
Figure BDA0002327919150000064
the combined characters are inquired and stored in a language dictionary Vi _ dit, and then are replaced by the standard Unicode coded characters corresponding to the combined characters in the dictionary;
e. encoding the converted sentence into
Figure BDA0002327919150000065
It is returned.
The method is characterized in that Arabic characters and Arabic form B characters are loaded in a resource file of the Persian, and the specific conversion steps are as follows:
a. inputting data and language tags, wherein the tags of the Gaussian are fa, and the example sentences are as follows:
Figure BDA0002327919150000061
(middle translation: the day that shakes violently in the sky)
b. Reading the resource file of the Persian language. The word in the Persian language is generally composed of 3 original letters, and a new word can be formed by adding prefixes, suffixes or changing the internal phonemes of the word and inserting other phonemes, so that each Arabic character corresponds to a word composed of four Arabic form B characters in the resource file and is placed in the structure Fa _ map;
c. traversing each character in the Persian language according to sentences, judging whether the character is in the coding interval of the Arabic form B in the Unicode standard coding set, and traversing the next character if the current character is not the character of the Arabic form B; if the current character is in the Arabic type B character interval, turning to the step d; after traversing the whole sentence, turning to step e;
d. c, traversing all characters in the Fa _ map, judging whether the current character exists in the Fa _ map, if not, returning to the step c to continue traversing the next character; if yes, converting the character into an Arabic character corresponding to the character, and then returning to the step c;
e. encoding the converted sentence
Figure BDA0002327919150000071
Figure BDA0002327919150000072
And returning.
The coding conversion of the bulgarian language requires converting latin letters in bulgarian language data into cyrillic letters, so that the resource file is loaded with the corresponding relationship between latin letters and cyrillic letters in bulgarian language, and the specific conversion steps are as follows:
a. inputting data and a language label, wherein the label of Bulgaria language is bg, and example sentences are as follows:
“Toйkaзa:"He mиcлr,чe иma пpoблem,kaпитaлoвиrт пaзap e mнoгo kooпepaтивeн.”
( The Chinese translation: he said that "I think there is no problem and the capital market is very cooperative. )
b. Loading a Bulgaria language resource file, and loading Sirill characters in Bulgaria language letters and Latin letters corresponding to the Sirilia language characters into a dictionary Bg _ dit in a key-value pair mode: wherein, latin letters are keys of a dictionary, and Sirill letters are values;
c. traversing each character in the Bulgarian language according to sentences, and judging whether the current character exists in the language dictionary Bg _ dict: if not, traversing the next character; if yes, the character is replaced by a corresponding value in the Bg _ dit, namely, the current Latin character is replaced by a corresponding Sirill character. In bulgarian words in example sentences, for example, "m", "k", "T", "a", "o", and "r" are all latin letters, which need to be converted to corresponding cyrillic letters according to Bg _ fact: "CM", "kappa", "T", "a", "o" and "1103,".
d. The example sentence after the encoding conversion is "tato \10811ka 107a:" n icon i c 3 1103, 1095or i icon p + z e, k pi ji, i xi, i ji bi, i jji yi ji 1103p ammonia, p xi ji, p ji, xi, ammonia, q, qi ji zhi ji (r a ji), returns it to.
The multi-national-language word segmentation method based on code conversion can be used for analyzing and converting according to different use characteristics of languages of various countries, and meets the requirement that a user uses the same word segmentation method to segment words of the languages of the countries. The method is flexible and simple, can be conveniently embedded into the word segmentation process, is convenient to switch among different languages, meets the requirement of simultaneous conversion of multiple languages, ensures the coding consistency of the languages in multiple coding modes, effectively enhances the data quality, reduces the situation of data sparseness in the neural machine translation training process, and improves the quality of the neural machine translation.

Claims (6)

1. A multi-lingual word segmentation method based on code conversion is characterized by comprising the following steps:
1) Data preprocessing: inputting data and a language label of a word to be segmented, filtering redundant spaces in the data and adjusting the data into a UTF-8 encoding format;
2) Loading a code conversion file: loading a code conversion resource file of a corresponding language according to the language tag input in the step 1);
3) And (3) code conversion: performing code conversion on the data by using the code conversion resource file loaded in the step 2);
4) Word segmentation: performing word segmentation on the data subjected to code conversion by using symbols such as punctuations, spaces and the like;
in step 3), loading the transcoding files of each language, and performing transcoding on data, specifically:
301 Input language data and language tags to be processed;
302 Read the resource file corresponding to the language and load the resource file into the memory;
303 Traverse each character in the linguistic data by sentence, judge whether the present character needs code conversion;
304 If code conversion is needed, converting the characters needed to be converted according to the conversion rules of each language and the corresponding resource files;
305 Output the transcoded sentence to step 4).
2. The transcoding-based multinational language segmentation method of claim 1, wherein: in step 2), the code conversion resource file specifically analyzes different characteristics of each language, distinguishes writing modes and using habits of each language, and loads corresponding code conversion files for processing according to coding intervals and conversion requirements of each language and by using characteristics of each language.
3. The transcoding-based multilingual word segmentation method of claim 2, wherein: for Vietnamese, the characters with syllables in the data have two writing modes of single characters and combined characters, coding conversion is carried out on the Vietnamese data before word segmentation, the characters in the non-standard coding mode in the data are converted into corresponding standard coding characters in a unified mode, and resource files of the two writing modes are loaded according to the corresponding relation of the standard coding and the non-standard coding of the same character.
4. The transcoding-based multinational language segmentation method of claim 2, wherein: aiming at the fact that most word characters are Arabic characters and a few words are Arabic form B coded characters in the Persian language data, the conversion rule of the Arabic characters and the Arabic form B characters in the Persian language is loaded, the Arabic form B characters in the Persian language data are converted into common Arabic characters, then subsequent word segmentation processing and machine translation training are carried out, and common translation between the Persian languages in multiple coding intervals is achieved.
5. The transcoding-based multinational language segmentation method of claim 2, wherein: aiming at the Bulgaria language, the Bulgaria language is written by using Sirillic letters, and the cases are distinguished, wherein part of letters are similar to the Latin writing method, but the codes are different, and the Bulgaria language data is mixed with a large number of Latin letters to replace the Sirillic letters in the Bulgaria language; and simultaneously loading a resource file for the Sirillic letters in the Bulgaria language, the confusable Latin letters corresponding to the Bulgaria letters and the corresponding relation between the Sirillic letters and the Bulgaria letters, and converting the data of the Bulgaria language according to the resource file.
6. The method of claim 1, wherein in step 303), if transcoding is not required, the next character is determined until all characters in the data have been traversed, and go to 305).
CN201911324149.4A 2019-12-20 2019-12-20 Multi-lingual word segmentation method based on code conversion Active CN111178061B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911324149.4A CN111178061B (en) 2019-12-20 2019-12-20 Multi-lingual word segmentation method based on code conversion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911324149.4A CN111178061B (en) 2019-12-20 2019-12-20 Multi-lingual word segmentation method based on code conversion

Publications (2)

Publication Number Publication Date
CN111178061A CN111178061A (en) 2020-05-19
CN111178061B true CN111178061B (en) 2023-03-10

Family

ID=70655529

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911324149.4A Active CN111178061B (en) 2019-12-20 2019-12-20 Multi-lingual word segmentation method based on code conversion

Country Status (1)

Country Link
CN (1) CN111178061B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033150A (en) * 2021-03-18 2021-06-25 深圳市元征科技股份有限公司 Method and device for coding program text and storage medium
CN113051889A (en) * 2021-04-09 2021-06-29 中译语通科技股份有限公司 Sentence breaking method and system for Gaussian language machine translation and application
CN114428658B (en) * 2022-01-25 2024-03-08 杭州国芯科技股份有限公司 Method for displaying Burmese by set top box
CN115297108B (en) * 2022-08-11 2023-08-25 青岛美迪康数字工程有限公司 Diagnosis and treatment quality control file transmission method and device based on piano syllables

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008864A (en) * 2006-01-28 2007-08-01 北京优耐数码科技有限公司 Multifunctional and multilingual input system for numeric keyboard and method thereof
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data
CN104331400A (en) * 2014-11-05 2015-02-04 中央民族大学 Mongolian code conversion method and device
CN106528536A (en) * 2016-11-14 2017-03-22 北京赛思信安技术股份有限公司 Multilingual word segmentation method based on dictionaries and grammar analysis
CN107526742A (en) * 2016-06-21 2017-12-29 伊姆西公司 Method and apparatus for handling multi-language text
CN107562480A (en) * 2017-09-05 2018-01-09 深圳市新国都支付技术有限公司 A kind of POS multi-lingual implementation method and its system
CN108363686A (en) * 2018-01-12 2018-08-03 中国平安人寿保险股份有限公司 A kind of character string segmenting method, device, terminal device and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8670976B2 (en) * 2011-03-31 2014-03-11 King Abdulaziz City for Science & Technology System and methods for encoding and decoding multi-lingual text in a matrix code symbol

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101008864A (en) * 2006-01-28 2007-08-01 北京优耐数码科技有限公司 Multifunctional and multilingual input system for numeric keyboard and method thereof
CN102880703A (en) * 2012-09-25 2013-01-16 广州市动景计算机科技有限公司 Methods and systems for encoding and decoding Chinese webpage data
CN104331400A (en) * 2014-11-05 2015-02-04 中央民族大学 Mongolian code conversion method and device
CN107526742A (en) * 2016-06-21 2017-12-29 伊姆西公司 Method and apparatus for handling multi-language text
CN106528536A (en) * 2016-11-14 2017-03-22 北京赛思信安技术股份有限公司 Multilingual word segmentation method based on dictionaries and grammar analysis
CN107562480A (en) * 2017-09-05 2018-01-09 深圳市新国都支付技术有限公司 A kind of POS multi-lingual implementation method and its system
CN108363686A (en) * 2018-01-12 2018-08-03 中国平安人寿保险股份有限公司 A kind of character string segmenting method, device, terminal device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于词典与HMM相结合的蒙古文最小词素编码到标准编码的转换研究";许杨;《中国优秀硕士学位论文全文数据库》;20190115;全文 *
"蒙古文语料编码转换与校对方法研究";乌云塔娜;《中国优秀硕士学位论文全文数据库》;20190215;全文 *

Also Published As

Publication number Publication date
CN111178061A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN111178061B (en) Multi-lingual word segmentation method based on code conversion
CN1954315B (en) Systems and methods for translating chinese pinyin to chinese characters
CN106528536A (en) Multilingual word segmentation method based on dictionaries and grammar analysis
CN112507734A (en) Roman Uygur language-based neural machine translation system
CN112766000A (en) Machine translation method and system based on pre-training model
Moukafih et al. Improving machine translation of arabic dialects through multi-task learning
Tennage et al. Transliteration and byte pair encoding to improve tamil to sinhala neural machine translation
CN113705223A (en) Personalized English text simplification method taking reader as center
Vasantharajan et al. Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English
CN109344389B (en) Method and system for constructing Chinese blind comparison bilingual corpus
CN109871550B (en) Method for improving digital translation quality based on post-processing technology
Hocking et al. Optical character recognition for South African languages
Lu et al. An automatic spelling correction method for classical mongolian
Klouche et al. Arabizi chat alphabet transliteration to Algerian dialect
CN111382583A (en) Chinese-Uygur name translation system with mixed multiple strategies
Amin et al. Kurdish Language Sentiment Analysis: Problems and Challenges
Alsayed et al. A performance analysis of transformer-based deep learning models for Arabic image captioning
Ohm et al. Study of Tokenization Strategies for the Santhali Language
CN115310433A (en) Data enhancement method for Chinese text proofreading
Sodhar et al. Exploration of Sindhi Corpus Through Statistical Analysis on the Basis of Reality
Nguyen et al. An in-depth analysis of OCR errors for unconstrained Vietnamese handwriting
Manrique-Gómez et al. Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Do Reference-Based Post-OCR Processing with LLM for Diacritic Languages
Kchaou et al. Bottom-up approach to translate Tunisian dialect texts in Social Networks
CN113033188B (en) Tibetan grammar error correction method based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Du Quan

Inventor after: Xu Ping

Inventor before: Du Quan

Inventor before: Xu Ping

Inventor before: Zhu Jingbo

Inventor before: Xiao Tong

Inventor before: Zhang Chunliang

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Multi language Word Segmentation Method Based on Encoding Conversion

Granted publication date: 20230310

Pledgee: China Construction Bank Shenyang Hunnan sub branch

Pledgor: SHENYANG YAYI NETWORK TECHNOLOGY CO.,LTD.

Registration number: Y2024210000102

PE01 Entry into force of the registration of the contract for pledge of patent right