Background
The language is the medium of human thought communication, is the most important communication tool for people, and is generated and developed along with the human society, so that the influence on politics, economy, science and technology and even culture is inevitable. Currently, there are 5651 languages found in the world, which are distributed in different parts of the world.
According to the common features and origin relations of the speech, grammar and vocabulary of each language, the linguists divide the languages in the world into a plurality of language families, each language family comprises a plurality of languages, the languages and the languages are distributed in certain regions, and many cultural features are closely related to the languages and the languages.
The word is the smallest language unit capable of being independently used in the language, and the word is generally analyzed as a basic unit in the machine translation system, so an effective and high-quality word segmentation module is crucial to the machine translation system.
The languages of all countries in the world have unique characteristics, and the languages can be roughly divided into two types by distinguishing the languages in a word segmentation mode: one is isolated or sticky similar to Chinese and Japanese; the other is most western national languages mainly including english, words in the languages use a space as a boundary, the words are called inflected language, the space between words in a text of the inflected language can specify the boundary of the words, and the sentence can be split into a plurality of continuous word combinations in a word splitting mode by using the space as a splitting mark, so that the complete sentence is split. Therefore, for most languages in western countries, the space is used as a segmentation mark to segment the languages.
At 19 ages, european scholars research nearly one hundred languages in the world, find that there are corresponding relations and similarities among voices, vocabularies and grammar rules of some languages, and classify the languages into one class, namely the same-family languages; because there is a corresponding relationship between different language families, they are summarized as a homologous language, which is the pedigree relationship of languages. In the 20 th century, linguists further classified world languages into various language families, such as the Hinoki language family, the Tibetan language family, the Cantonese language family, and the like. However, the languages of various countries are divided into different language families, each language family has its own characteristics, there are many differences between different languages in the same language family, and there are many different encoding and writing methods in some languages, for example:
1) Vietnamese has two coding sets, one of which is an independent character, and the other is formed by combining two characters.
2) The arabic character has a plurality of expressions such as arabic language, arabic form a and arabic form B, and two kinds of coded data such as the arabic character and the arabic form B character appear in the bosch language at the same time.
3) Bulgaria belongs to south-schlavian branches of the indolo european language family, and is written by using cyrillic letters, and a large number of latin letters needing to be converted are often doped in the bulgaria.
As shown in the above situation, a word segmentation method cannot satisfy all language features simultaneously, and it is difficult to implement word segmentation functions for all languages simultaneously in the same word segmentation manner, but existing languages are of various types, and designing a unique word segmentation manner for each language is too cumbersome and impractical, so that different languages need to be learned and analyzed, and data coding conversion preprocessing is performed on the words according to the features of the languages in a targeted manner, and then the words are uniformly segmented.
The Unicode coding is a new coding scheme generated for solving the limitation of the traditional character set coding scheme, and a unique binary code is uniformly set for each character in each language so as to meet the requirements of text conversion and processing in cross-language and cross-platform. The Unicode only has one character set in the Unicode coding, thereby effectively avoiding the ambiguity of the double-byte character set, and the Unicode coding is widely applied in the information exchange field of the global scope at present. In Unicode encoding, each character block has its own encoding range based on the same standard, such as greek letters, cyrillic, amantan, etc., and each character has its own encoding interval in a specific range. FIG. 1 is a Unicode encoding section for a partial language.
In the three languages mentioned above, each language has its own coding region, but occasionally some noise data is mixed in these languages: in some languages, one character is split into a combination of two characters, such as vietnamese; some are characters from two different coding regions of the same language, such as the Persian language; some of the data include characters with other coding regions, such as bulgarian language.
The multi-national-language word segmentation method based on code conversion can unify different codes of the same language, and different expressions and writing modes of the same language are summarized and sorted together, so that the size of a training data vocabulary is effectively reduced, meanwhile, the sparse problem of training data can be effectively relieved, the quality of word segmentation results in machine translation is improved, and the quality of machine translation translated texts is optimized.
At present, a multilingual word segmentation method based on code conversion, which can meet the requirements, is not reported yet.
Disclosure of Invention
Aiming at the defects that the word segmentation method for the multi-national languages in the prior art mainly segments spaces and punctuations, is difficult to meet the requirement of segmenting words for the multi-national languages with various codes at the same time, cannot obtain high-quality word segmentation results and the like, the invention provides a word segmentation mode based on code conversion, which can meet the interconversion of the multi-national languages in a multi-coding interval.
In order to solve the technical problems, the invention adopts the technical scheme that:
the invention relates to a multi-national-language word segmentation method based on code conversion, which comprises the following steps of:
1) Data preprocessing: inputting data to be segmented and a language label, filtering redundant spaces in the data and adjusting the data into a UTF-8 encoding format;
2) Loading a code conversion file: loading a code conversion resource file of a corresponding language according to the language tag input in the step 1);
3) And (3) code conversion: performing code conversion on the data by using the code conversion resource file loaded in the step 2);
4) Word segmentation: and performing word segmentation processing on the code-converted data by using symbols such as punctuations and spaces.
In step 2), the code conversion resource file specifically analyzes different characteristics of each language, distinguishes writing modes and using habits of each language, and loads corresponding code conversion files for processing according to coding intervals and conversion requirements of each language and by using characteristics of each language.
For Vietnamese, the syllable-carrying characters in the data have two writing modes of single characters and combined characters, the Vietnamese data is subjected to coding conversion before word segmentation, the characters in the data in a non-standard coding mode are uniformly converted into corresponding standard coding characters, and the resource files of the two writing modes are loaded according to the corresponding relation of the standard coding and the non-standard coding of the same character.
Aiming at the fact that most word characters are Arabic characters in the Persian data, and a few words are coded characters in an Arabic form B, conversion rules of the Arabic characters and the Arabic form B characters in the Persian data are loaded, the characters in the Arabic form B in the Persian data are converted into common Arabic characters, and then subsequent word segmentation processing and machine translation training are carried out, so that common translation between the Persian in multiple coding sections is achieved.
Aiming at the Bulgaria language, the Bulgaria language is written by using Sirillic letters, and the cases are distinguished, wherein part of letters are similar to a Latin writing method but are different in codes, and the Bulgaria language data is mixed with a large number of Latin letters to replace the Chinese and Western Larillic letters of the Bulgaria language; and simultaneously loading a resource file for the Sirillic letters in the Bulgaria language, the confusable Latin letters corresponding to the Bulgaria letters and the corresponding relation between the Sirillic letters and the Bulgaria letters, and converting the data of the Bulgaria language according to the resource file.
In step 3), loading transcoding files of each language, and transcoding data, specifically:
301 Input language data and language tags to be processed;
302 Read the resource file corresponding to the language and load the resource file into the memory;
303 Traverse each character in the linguistic data by sentence, judge whether the present character needs code conversion;
304 If code conversion is needed, converting the characters needed to be converted according to the conversion rules of each language and the corresponding resource files;
305 Output the transcoded sentence to step 4).
In step 303), if no transcoding is needed, the next character is determined continuously until all characters in the data have been traversed, and go to step 305).
The invention has the following beneficial effects and advantages:
1. the multi-national-language word segmentation method based on code conversion can simultaneously meet different coding characteristics of multi-national languages, performs analysis and code conversion according to the characteristics of different languages in a targeted manner, and meets the requirement that the multi-national-language word segmentation can be simultaneously performed by using one word segmentation method.
2. The method of the invention analyzes and learns the characteristics of different languages, carries out code conversion on multi-language data, can solve the problem that the same language data has multiple coding modes, and can filter error coding data in the data at the same time, thereby improving the quality of the multi-language data.
3. The code conversion method provided by the invention can effectively reduce data sparseness and enhance data quality, and can also reduce the vocabulary size of data in the subsequent machine translation training process, thereby effectively improving the translation quality of machine translation.
Detailed Description
The invention is further elucidated with reference to the accompanying drawings.
The invention standardizes a plurality of codes by carrying out Unicode code conversion on the data with a plurality of code expression modes in the same language in a plurality of national languages, and is convenient for word segmentation system processing. Each language in the Unicode code set has a coding interval in a specific range, so that ambiguity of the character set is effectively avoided, and the coding intervals of partial languages are shown in figure 1.
Fig. 3 is a general flow chart of a multilingual word segmentation method based on transcoding, and the multilingual word segmentation method based on transcoding of the present invention specifically includes the following steps:
1) Data preprocessing: inputting data to be segmented and a language label, filtering redundant spaces in the data and adjusting the data into a UTF-8 encoding format;
2) Loading a code conversion file: loading a code conversion resource file of a corresponding language according to the language tag input in the step 1);
3) And (3) code conversion: performing code conversion on the data by using the code conversion resource file loaded in the step 2);
4) Word segmentation: and performing word segmentation processing on the code-converted data by using symbols such as punctuations and spaces.
In step 1), each language has its own corresponding language tag, and the language tags are composed of 2 to 3 English letters and used for marking the name of the language. Inputting the language tags into a word segmentation system, so that the system can conveniently identify the language and perform the next coding conversion processing, wherein part of the language tags are shown in FIG. 2;
in step 2), the code conversion resource file specifically analyzes different characteristics of each language, distinguishes writing modes and using habits of each language, and loads corresponding code conversion files for processing according to coding intervals and conversion requirements of each language and by using characteristics of each language.
Taking Vietnamese, persian and Bulgaria as examples, the three languages are characterized as follows:
step 201) Vietnamese is a tonal language that uses tones to distinguish word senses written using Latin letters, which are classified into syllabic letters and non-syllabic letters, and the standard syllabic letter is a separate character, but this character can also be written by combining a non-syllabic character with a phonetic symbol, which results in two writing modes for the syllabic character in the Vietnamese data.
By letters
For the purpose of example only,
is an independent character, belongs to Latin extended additional characters in a Unicode code set, and the standard Unicode code is IEAC; in addition, another writing method exists in Vietnamese data
Is made up of characters
And the pronunciation symbol ". -" are written in combination, which is equivalent to two characters in a Unicode code set. The two writing methods cannot distinguish when reading Vietnamese, but when a word segmentation system performs word segmentation on data, the non-standard coded characters can be split into two characters, so that the content of original data is changed, the original data loses the meaning of the original data, and the quality of the data is reduced. Therefore, before word segmentation, the vietnamese data needs to be subjected to code conversion, and characters in the data in a non-standard coding mode are uniformly converted into corresponding standard coding characters.
The number of Vietnamese letters is 195, most of the characters are syllable-carrying characters which can be written by combining common characters and pronunciation symbols, and therefore, vietnamese needs to load resource files of two writing modes according to the corresponding relation of standard codes and non-standard codes of the same character.
Step 202) the Bose language is composed of 28 Arabic letters and 4 newly-created Bose letters, and a large number of Arabic borrows are contained in the language. In the Unicode code set, there are three sets containing Arabic letters, arabic form A, and Arabic form B, respectively. The Arabic letters are positioned in a coding interval of 0x 0600-0 x06FF, and because of the special characteristics of the writing specification of the Arabic letters, the forms of the same letter at different positions in a word are different, so the Unicode coding set also defines two forms of coding of an Arabic form A and an Arabic form B to specify other representation methods of the Arabic letters, wherein the Arabic form B defines the distortion rule and display characters of the Persian, and the coding interval is 0xE 70-0 xFEFF.
In the existing Persian language data, most characters in the Persian language are Arabic characters, and a few words are composed of characters coded by Arabic type B. Because the two types of Persian data come from different coding intervals, machine translation training is carried out by using the two types of Persian data, the conditions of data sparseness and overlarge vocabulary table caused by different coding intervals affect the translation quality; meanwhile, the Bose data of the coding interval of the Arabic type B is less, and the performance of a machine translation system is influenced when the proportion in the training data is too small. The gaussian coded in arabic form B cannot be discarded for the purpose of communication between multiple nationalities and multiple languages, and for the purpose of protecting human language culture. Therefore, it is necessary to load a conversion rule between the arabic character and the arabic form B character in the bosch language, convert the arabic form B character in the bosch language data into a common arabic character, and then perform subsequent word segmentation processing and machine translation training, thereby implementing common translation between the bosch languages in multiple coding regions.
Step 203) the development of bulgaria language goes through three stages of ancient bulgaria language, middle ancient bulgaria language and modern bulgaria language. Modern bulgaria is written using 30 cyrillic letters and is case-specific, with some letters being similar to, but not identical to, latin writing.
For example, the letters "B" and the latin character "B" in bulgarian look the same writing, but actually the Unicode code for the letter "B" in bulgarian is 0x412, the Unicode code for the latin character "B" is 0x42, which are two completely different letters, and such letters are also many in bulgarian.
The training data of machine translation is wide in source, and a large number of Latin letters are mixed in massive Bulgaria language data crawled from the internet to replace the situation of the Chinese and Western Rier letters of Bulgaria language. In order to retain real bulgarian data and reduce the difference in the data, the bulgarian data may be first cleaned before word segmentation, and the latin letters in bulgarian may be converted into corresponding cyrillic letters.
In the coding conversion resource file of the bulgarian language, cyrillic letters in the bulgarian language, confusable latin letters corresponding to the bulgarian letters, and the corresponding relationship between the cyrillic letters and the confusable latin letters need to be loaded simultaneously, and the bulgarian language data needs to be converted according to the resource file.
In step 3), according to the analysis of the three languages in step 2), loading the transcoding file of each language, and transcoding the data, specifically:
301 Input language data and language tags to be processed;
302 Read the resource file corresponding to the language and load the resource file into the memory;
303 Traverse each character in the linguistic data by sentence, judge whether the present character needs code conversion;
304 If code conversion is needed, converting the characters needed to be converted according to the conversion rules of each language and the corresponding resource files;
305 Output the transcoding sentence to step 4);
in step 303), if no transcoding is needed, the next character is determined continuously until all characters in the data have been traversed, and go to step 305).
In the following, the specific conversion method is as follows, taking vietnamese, bosch and bulgarian as examples:
the Vietnamese loaded code conversion file comprises a standard Unicode code and a non-standard writing method, the two coding modes correspond to each other one by one, and the specific code conversion steps are as follows:
a. inputting data and language tags, wherein the tags of Vietnamese are vi, and example sentences are as follows:
(middle translation: exit Iran nuclear protocol in the United states)
b. Reading the Vietnamese resource file, loading the content in the Vietnamese resource file into a memory in a dictionary form corresponding to key values, and naming the Vietnamese resource file as Vi _ dit: the key of Vietnamese coding dictionary is a reading symbol combination of non-standard coding, and the corresponding value is a standard Unicode coding character;
c. traversing each character in the sentence, and judging whether the current character is a reading symbol: if the current character is not a pronunciation symbol, continuously traversing the next character; if the current character is a pronunciation symbol, go to step d. After traversing the whole sentence, turning to step e;
d.if the current character is a pronunciation symbol, combining the previous character with the current character, and judging whether the current combined character exists in Vi _ fact: if the combined letter exists, replacing the combined letter with a corresponding standard code character, and then returning to the step c to traverse the next character; if the combined character does not exist in Vi _ dit, directly returning to the step c. The combinations of phonetic symbols to be converted appearing in example sentences are respectively
And
the combined characters are inquired and stored in a language dictionary Vi _ dit, and then are replaced by the standard Unicode coded characters corresponding to the combined characters in the dictionary;
e. encoding the converted sentence into
It is returned.
The method is characterized in that Arabic characters and Arabic form B characters are loaded in a resource file of the Persian, and the specific conversion steps are as follows:
a. inputting data and language tags, wherein the tags of the Gaussian are fa, and the example sentences are as follows:
(middle translation: the day that shakes violently in the sky)
b. Reading the resource file of the Persian language. The word in the Persian language is generally composed of 3 original letters, and a new word can be formed by adding prefixes, suffixes or changing the internal phonemes of the word and inserting other phonemes, so that each Arabic character corresponds to a word composed of four Arabic form B characters in the resource file and is placed in the structure Fa _ map;
c. traversing each character in the Persian language according to sentences, judging whether the character is in the coding interval of the Arabic form B in the Unicode standard coding set, and traversing the next character if the current character is not the character of the Arabic form B; if the current character is in the Arabic type B character interval, turning to the step d; after traversing the whole sentence, turning to step e;
d. c, traversing all characters in the Fa _ map, judging whether the current character exists in the Fa _ map, if not, returning to the step c to continue traversing the next character; if yes, converting the character into an Arabic character corresponding to the character, and then returning to the step c;
e. encoding the converted sentence
And returning.
The coding conversion of the bulgarian language requires converting latin letters in bulgarian language data into cyrillic letters, so that the resource file is loaded with the corresponding relationship between latin letters and cyrillic letters in bulgarian language, and the specific conversion steps are as follows:
a. inputting data and a language label, wherein the label of Bulgaria language is bg, and example sentences are as follows:
“Toйkaзa:"He mиcлr,чe иma пpoблem,kaпитaлoвиrт пaзap e mнoгo kooпepaтивeн.”
( The Chinese translation: he said that "I think there is no problem and the capital market is very cooperative. )
b. Loading a Bulgaria language resource file, and loading Sirill characters in Bulgaria language letters and Latin letters corresponding to the Sirilia language characters into a dictionary Bg _ dit in a key-value pair mode: wherein, latin letters are keys of a dictionary, and Sirill letters are values;
c. traversing each character in the Bulgarian language according to sentences, and judging whether the current character exists in the language dictionary Bg _ dict: if not, traversing the next character; if yes, the character is replaced by a corresponding value in the Bg _ dit, namely, the current Latin character is replaced by a corresponding Sirill character. In bulgarian words in example sentences, for example, "m", "k", "T", "a", "o", and "r" are all latin letters, which need to be converted to corresponding cyrillic letters according to Bg _ fact: "CM", "kappa", "T", "a", "o" and "1103,".
d. The example sentence after the encoding conversion is "tato \10811ka 107a:" n icon i c 3 1103, 1095or i icon p + z e, k pi ji, i xi, i ji bi, i jji yi ji 1103p ammonia, p xi ji, p ji, xi, ammonia, q, qi ji zhi ji (r a ji), returns it to.
The multi-national-language word segmentation method based on code conversion can be used for analyzing and converting according to different use characteristics of languages of various countries, and meets the requirement that a user uses the same word segmentation method to segment words of the languages of the countries. The method is flexible and simple, can be conveniently embedded into the word segmentation process, is convenient to switch among different languages, meets the requirement of simultaneous conversion of multiple languages, ensures the coding consistency of the languages in multiple coding modes, effectively enhances the data quality, reduces the situation of data sparseness in the neural machine translation training process, and improves the quality of the neural machine translation.