note

Open access

Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion Languages

Authors:

Andargachew Mekonnen Gezmu,

Andreas NürnbergerAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 9

Article No.: 231, Pages 1 - 19

https://doi.org/10.1145/3610773

Published: 22 September 2023 Publication History

PDF eReader

Abstract

Neural approaches, which are currently state-of-the-art in many areas, have contributed significantly to the exciting advancements in machine translation. However, Neural Machine Translation (NMT) requires a substantial quantity and good quality parallel training data to train the best model. A large amount of training data, in turn, increases the underlying vocabulary exponentially. Therefore, several proposed methods have been devised for relatively limited vocabulary due to constraints of computing resources such as system memory. Encoding words as sequences of subword units for so-called open-vocabulary translation is an effective strategy for solving this problem. However, the conventional methods for splitting words into subwords focus on statistics-based approaches that mainly conform to agglutinative languages. In these languages, the morphemes have relatively clean boundaries. These methods still need to be thoroughly investigated for their applicability to fusion languages, which is the main focus of this article. Phonological and orthographic processes alter the borders of constituent morphemes of a word in fusion languages. Therefore, it makes it difficult to distinguish the actual morphemes that carry syntactic or semantic information from the word’s surface form, the form of the word as it appears in the text. We, thus, resorted to a word segmentation method that segments words by restoring the altered morphemes. We also compared conventional and morpheme-based NMT subword models. We could prove that morpheme-based models outperform conventional subword models on a benchmark dataset.

1 Introduction

Machine translation is challenging because of language differences such as morphological variations [23]. Categorizing languages for cross-linguistic comparison is also difficult [39]. One way to make such a comparison is by assessing the dimensions of morphological typology. According to Jurafsky and Martin [42], morphological typology can vary in two dimensions: the first dimension ranges from isolating to polysynthetic, and the second dimension ranges from agglutinative to fusional. The first dimension relates to the number of morphemes per word. In isolating morphology, words typically consist of only one morpheme, while in polysynthetic morphology, words have multiple morphemes. The second dimension has to do with how segmentable morphemes are. It encompasses morpheme boundaries that are generally clear in agglutinative morphology and morpheme boundaries that are hazy in fusional morphology. The dimensions can be exemplified by Vietnamese (isolating), Siberian Yupik (polysynthetic), Turkish (agglutinative), and Amharic (fusional).

Different approaches have been used up to this point to automate the intricate task of translation. The initial attempt involved using rule-based systems to translate a text from the source language. However, developing rule-based systems is time-consuming and costly because it is difficult to codify all the essential language knowledge for accurate translations using hand-crafted rules. Besides, it has limited scalability. It also necessitates considerable linguistic knowledge and resources that might not be available for low-resource languages [38]. Therefore, alternative data-driven strategies emerged as parallel corpora became more widely accessible. Such methods benefit from the accurate translations produced by human translators as they use curated parallel training data, or parallel corpora, to create translation models by relying on machine learning.

The two most well-known data-driven approaches are Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). NMT has surpassed SMT in recent years [4, 69, 73, 77]. Its success is attributed to its unique characteristics. First, unlike SMT, the whole NMT system components can be jointly tuned to optimize the translation performance. Second, it processes complete sentences, not just words or n-grams, like SMT. Third, it handles syntactic and semantic differences in languages better than SMT [7, 13]. Ultimately, it produces more fluent translations than SMT [47]. Nevertheless, the amount and quality of parallel training data significantly affect the performance of NMT models [36]. Very few of the approximately 7,000 languages spoken today have adequate training data with the required amount and quality for NMT.

Additionally, because NMT only works with a fixed vocabulary due to the constraints of computing resources such as the GPU’s dedicated memory, it has trouble handling rare and out-of-vocabulary words in texts [68]. When a word has multiple morphemes, as in synthetic languages such as Amharic and Turkish, the issue is exacerbated. In these languages, a single word may have thousands of different inflections, and the languages’ lexicon may number in the hundreds of thousands or millions. For instance, in Amharic, the official language of Ethiopia, the word

/ meaning “to be” has roughly five-thousand inflections in the 22 million tokens of Contemporary Amharic Corpus¹ (CACO²) [33]. In Amharic, a space-delimited word may represent a phrase, clause, or sentence. For example, the word

/ meaning “until he explains it to them” is a clause. This word does not appear even once in the CACO corpus. Nevertheless, its constituent morphemes —

/ — appear several times in the corpus, being part of other words. Hence, the vocabulary of the language is too large for NMT.

Many suggested methods have been established for relatively small vocabulary because NMT is computationally resource-intensive and necessitates vast quantities of parallel training corpora. Segmenting words as sequences of subword units for so-called open-vocabulary translation is one effective way to address this issue [68]. By doing this, it may train the models using all words. As a result, it can utilize the limited training data effectively. Furthermore, because some of the unknown words are only variations of existing words already included in the training data, the method also somewhat helps solve the issue of out-of-vocabulary words. However, the most conventional word segmentation methods commonly used in NMT [55], such as Byte Pair Encoding (BPE) [68], Word-Piece [67, 78], and Sentence Piece with Unigram Language Modeling (SPULM) [48] conform to agglutinative morphology. They depend on statistics-based methods for splitting words into subword units. The boundaries between morphemes, or meaningful word components, are generally clear in agglutinative morphology. The suitability of these methods for fusional morphology needs to be examined. The borders of constituent morphemes in fusion languages are altered by phonological and orthographic processes, making it difficult to separate the actual morphemes that carry syntactic or semantic information from written words or surface forms. For instance, in the word unreasonably, the morphemes un and reason are simply concatenated or agglutinated. The subword ably is fused two morphemes, able and ly; it, thus, exhibits fusional morphology. Table 1 demonstrates the segmentation of unreasonably using different segmentation methods. All methods would either under-segment or over-segment it. Therefore, we resorted to a morphological word segmentation method that segments morphemes by restoring the actual morphemes. We also compared conventional and morpheme-based NMT subword models in an evaluation study on a benchmark dataset, Amharic-English parallel corpus.

Table 1.

Method	Sentence
BPE	un-reason-ably
SPULM	un-re-as-on-ab-ly
Word-Piece	un-reas-ona-bl-y

Table 1. Segmentation of the English Word Unreasonably Using Different Methods

In the following, after we discuss related work in Section 2, in Section 3, we explain the proposed word segmentation method. To apply and evaluate the method, we developed a baseline NMT system. In Section 4, we compare conventional and morpheme-based NMT subword models. In Section 5, we report the results of the evaluation study. In the last section, Section 6, we give conclusions and future research direction.

2 Related Work

Although translation is an open-vocabulary problem, NMT models operate with a fixed vocabulary due to the limitations of computational resources. During the training of NMT models, the top most frequent words, commonly between 30,000 and 80,000, are included in the vocabulary [4, 73]. Both unseen words at training time and less frequent (rare) words, thus, will be out-of-vocabulary words. In practical NMT model training, a unique token represents them. This technique works well when there are only a few out-of-vocabulary words. However, the translation performance degrades rapidly as the number of out-of-vocabulary words increases [4, 15]. The problem worsens for languages dominated by synthetic morphology, either agglutinative or fusional. These languages can have hundreds of thousands, if not millions, of words in their vocabulary, most of which become out-of-vocabulary words. The worst-case scenario is when we use small training data of synthetic low-resource languages, which brings forth several out-of-vocabulary words during inference.

Another procedure to address the translation problem of out-of-vocabulary words is a back-off to a dictionary lookup [41, 52]. Nonetheless, this approach requires supplementary resources such as bilingual lexicons, which may only be readily available for some low-resource languages. It also makes assumptions that only sometimes hold in reality, like a one-to-one correspondence between the source and target language words [68].

A feasible solution to make an NMT model capable of open-vocabulary translation is segmenting words as sequences of subword units [48, 67, 68, 78, 80]. The extreme case can be a character-level segmentation [16, 51]. However, compared with higher-level subwords, translating characters results in longer sequences, which is challenging for both modeling and computation [55].

In subword NMT models, most conventional word segmentation methods follow statistics-based approaches that use a data compression method [29] to reduce text entropy [70], the idea derived from information theory. We can consider a text as a sequence of symbols (i.e., words or subwords), where each symbol is generated with a certain probability and carries a certain information content [9]. The higher the probability of a symbol, the lower its information content [37]. According to Mielke et al. [55], the most conventional word segmentation methods commonly used in NMT are BPE [68], Word-Piece [67, 78], and SPULM [48]. BPE iteratively replaces the most frequent pair of characters in a sequence with a single, unused character. A subword learner first decomposes the entire training text into single characters. Then, it induces a vocabulary by iteratively merging the most frequent adjacent pairs of characters or subwords until the desired subword vocabulary is achieved. Once the subword vocabulary is learned, a segmenter splits words in a text by greedily segmenting words with the longest available subword type. Word-Piece is analogous to BPE. However, while BPE uses co-occurrence frequency to apply potential mergers of subwords, Word-Piece relies on the likelihood of an n-gram language model trained on a version of the training text that contains the merged subwords. SPULM, on the other hand, is a fully probabilistic method based on a unigram language model. Unlike BPE or Word-Piece, SPULM builds the vocabulary using a top-down approach. It starts with a vast starting vocabulary containing all characters and the most frequent subword candidates in the training text. Then, it iteratively removes subwords from the vocabulary that do not improve the overall probability. It is similar to Morfessor’s unsupervised segmentation [18], apart from Morfessor’s informed priority over subword length [11, 62].

The conventional methods for word segmentation are language-independent. Remarkably, they work well for agglutinative languages, in which words are formed by concatenating morphemes, since they work only with the surface form of words in estimating subword units. However, they overlook the morphology of fusion languages, in which words are formed by blending several morphemes. As a result, they may lead to the loss of semantic or syntactic information contained in the word structure. Nevertheless, there are several variants to the purely statistics-based conventional word segmentation methods to make them morphology-aware [3, 40, 53, 56, 66]. These modifications, however, did not seem to improve the original methods for translation of a few low-resource language pairs [21, 56, 64, 75].

Another issue with using traditional segmentation methods in low-resource settings is determining the optimal vocabulary size based on the degree of segmentation [22, 34, 69]. There are mixed results regarding the optimal vocabulary size when training subword NMT models. While Wu et al. [78] and Denkowski and Neubig [20] recommend a value between 8,000 and 32,000 for the vocabulary size, Cherry et al. [14] and Ding et al. [22] argue that such large vocabularies degrade the performance of the models, especially in low-data conditions. Thus, the size of the vocabulary needs to be tailored to the dataset. Therefore, we need to train several models with different possible vocabulary sizes to obtain the best model. However, since this trial training involves high computational costs, some techniques have been proposed to estimate the optimal vocabulary size. Salesky et al. [63] proposed a method that gradually introduces new BPE vocabulary online based on the persistent validation loss. It starts with smaller, general subwords and adds larger, more specific units as training progresses. Xu et al. [79] proposed another efficient solution, VOLT (Vocabulary Learning via Optimal Transport), by applying the Economics concept of marginal utility [65], where the benefit is text entropy and the cost is vocabulary size. On the one hand, increasing vocabulary size reduces text entropy, which benefits model learning [8]. On the other hand, an extensive vocabulary leads to parameter explosions and data sparseness, which is detrimental to model learning [1]. Therefore, Xu et al. [79] formulated vocabulary construction as an optimization problem aimed at finding the optimal vocabulary size with the highest marginal utility.

3 Morpheme-Based Word Segmentation

In fusion languages, phonological and orthographic processes modify the boundaries of the actual morphemes. To restore the altered morphemes, the most straightforward approach is to examine the morphological analysis or treebanks of the languages. The following subsections discuss the morphology and morpheme-based segmentation of the predominantly fusion language Amharic along with English.

3.1 Morphology

English has a relatively simple fusional morphology. For example, in the word unreasonably, the morphemes are un, reason, able, and ly. Thus, the subword ably blends two morphemes: able and ly; to obtain the actual morphemes, we need to restore the letters le. Amharic, on the other hand, has a rich morphology. In Amharic, a space-delimited word is a blend of several morphemes. It may function as a word (e.g.,

/ meaning “human”); a phrase (e.g.,

/ meaning “from her house”); a clause (e.g.,

/ meaning “the one who came”); or even a sentence (e.g.,

/ meaning “She did not eat.”).

Amharic is dominated by fusional morphology; the boundaries of morphemes are unclear in many words. Like other Semitic languages, Amharic word formation rides on root-and-pattern morphology. Root-and-pattern morphology is non-agglutinative because the two morphemes that make up the word, the root and pattern, are interlaced instead of concatenated [27]. For example, the Amharic verbs

/ “he/it will break” and

/ “he/it will be broken” have a prefix

and a suffix

to indicate tense and aspect. When removing the affixes from both words, the stems

/ and

/ remain; they are composed of two parts, the root consisting of the consonant sequence

/, and the pattern consisting of a template of vowels. In the first word, the pattern consists of the vowel

/ between the first and second consonant and no vowel between the second and third consonant, i.e.,

/; in the second word, the pattern consists of the same vowel in both positions, i.e.,

3.2 Word Segmentation

We devised a morphological word segmentation method, MorphoSeg, based solely on a language’s morphological analyzer or treebank. It segments actual morphemes from words by restoring morphemes that phonological and orthographic processes have altered.

Universal Morphology (UniMorph) [5] does have a morpheme segmentation database for English, but most entries have shallow segmentations as far as our need is concerned. For instance, the adjective unaccountable is not segmented at all, even if we expect the segmentation to be un-account-able. Therefore, we used a morphology treebank manually curated by Cotterell et al. [17] as a seed for English morpheme-based word segmentation. The morphology treebank consists of about 7,000 word types. To increase its coverage, we extracted all sentences from the monolingual News Crawl corpus.³ First, we lemmatize each word in each sentence using the Word Net Lemmatizer and Part-of-Speech (PoS) Tagger in the Natural Language Toolkit (NLTK) [10]. Then, we check whether the lemma is in the treebank and has further segmentation. If it does so, then the word is segmented accordingly. Otherwise, the word is segmented based on its lemma and the remaining subwords. Due to the relatively simpler morphology of English, most of the remaining subwords are either prefixes or suffixes. For example, the noun achievements has a lemma achievement; therefore, the initial segmentation is achievement-s; yet achievement is segmented in the treebank as achieve-ment. Thus, the final segmentation will be achieve-ment-s. Eventually, we created a morpheme segmentation database for nearly 42,000 word types along with their PoS. We have provided the English morpheme segmentation database at https://github.com/andmek/EngSegTable.

The words in an English sentence are successively segmented using MorphoSeg using the morpheme segmentation database. To ascertain the PoS of each word, it makes use of the NLTK’s PoS tagger. The segmentation for that word is utilized instead if a word type and its PoS are found in the lookup table. If not, the word is simply returned because it is assumed that it cannot be segmented. For example, it segments “She acts unreasonably and without knowledge.” by finding each word, its PoS, and segmentation in the lookup table, resulting in “She act-s un-reason-able-ly and without knowledge”.

We used a morphological analyzer and generator, HornMorpho [30], for Amharic morpheme-based word segmentation. HornMorpho is a rule-based system for morphological analysis and generation. It forms a cascade of composite finite-state transducers that implement a lexicon of roots and morphemes, as well as alternation rules that govern phonological or orthographic changes at morpheme boundaries [6]. HornMorpho analyzes only nouns and verbs prior to version 2.5. Since Amharic adjectives behave like nouns, HornMorpho also does not distinguish between adjectives and nouns. It cannot handle compound words and light verb constructions either. Therefore, we helped the author to modify HornMorpho.⁴ The improved version distinguishes more parts of speech, such as verbs, nouns, adjectives, adverbs, and conjunctions. It has more lexicons than before; it also performs morphological analysis for constructions such as light verbs and compound words. Batsuren et al. [5] also used it in the UniMorph 4.0 project to generate the Amharic inflectional data.

We extracted all distinct words from the CACO corpus to compile a morpheme segmentation database. To create the database, we used HornMorpho’s analyzer by removing the grammatical features. For example, HornMorpho analyzes

/ as

(subject = 3rd person singular masculine)-

(infinitive)-

(object = 1st person)-

(auxiliary); when removing the grammatical features in the parentheses, it becomes

/. When HornMorpho provided multiple analyses of a word, we took the first analysis; we have not disambiguated the PoS of words in a sentence as HornMorpho does not have such a feature. Finally, we created a morpheme segmentation database for approximately 840,000 word types. We have provided the morpheme segmentation database at https://github.com/andmek/AmhSegTable.

Using the morpheme segmentation database, MorphoSeg sequentially segments the words in an Amharic sentence. It initially fills a lookup table with the database’s entries. If a word type is present in the lookup table, the segmentation for that word is used in its place. If not, it assumes that the word cannot be subdivided and simply returns it. For example, to segment “

”, meaning “It is better for me to live with my mother.”, it first transliterates the sentence as “

”; then, it segments each word by searching the lookup table for its segmentation, outputting “

”

4 Experiments and Evaluation

Using a baseline Transformer-based encoder-decoder system, we trained and evaluated several subword models. We made the experiments in two scenarios. In the first scenario, we applied Word-Piece on the raw (untransliterated) Amharic text to segment words into subwords with a vocabulary size of 32,000 based on Wu et al. [78]. In the second scenario, we used different word segmentation methods, such as BPE, Morfessor, MorphoSeg, SPULM, and Word-Piece, applying them to transliterated Amharic text. Nevertheless, we used the same dataset and followed similar preprocessing, training, and evaluation steps for both scenarios.

4.1 Baseline System

The encoder-decoder network is a de facto architecture for NMT. An NMT system can implement the encoder and decoder with recurrent neural networks or Transformers [77]. The Transformer-based models attain the highest performance in both high- and low-resource conditions [2, 50, 69]. Thus, we used the Transformer-based encoder-decoder architecture to train NMT models. It uses the Adam optimizer [43], with varied learning rates throughout the training, a dropout [72] rate of 0.1, label smoothing [74] value of 0.1, and batch size⁵ of 1,024. We used the Tensor2Tensor [76] library to implement the system. Because the library supports only a joint subword vocabulary, we used the joint vocabulary of both source and target languages.

4.2 Amharic Transliteration

Transliteration or character mapping facilitates vocabulary sharing, especially loan words and named entities, between languages [19, 35]. Since we used a joint subword vocabulary for Amharic and English, we examined the orthography of Amharic and developed a rule-based transliteration method, Amharic Transliteration for Machine Translation. It maps Amharic characters to their phonemic representations in Latin-based characters. We detailed the method in Gezmu et al. [31] and provided its implementation at https://github.com/andmek/AT4MT.

4.3 Dataset and Preprocessing

We trained our models on the benchmark dataset of the Amharic-English⁶ [32] parallel corpus. Table 2 shows the number of sentence (segment) pairs in the dataset.

Table 2.

Dataset	Number of Sentence Pairs
Test Set	2,500
Validation Set	2,864
Training Set	140,000

Table 2. The Number of Sentence (Segment) Pairs in the Amharic-English Parallel Data

We preprocessed the dataset with the standard Moses tool [46] to prepare it for machine translation training. We tokenized the English data with Moses’ tokenizer script; we modified Moses’ script to tokenize the Amharic data. We used BPE, Morfessor, MorphoSeg, SPULM, and Word-Piece to segment words in the datasets for subword models. We used the BPE⁷ implementation in Sennrich et al. [68]; the Morfessor 2.0⁸ implementation in Smit et al. [71]; the SPULM implementation in the sentence-piece⁹ library [49]; and the Word-Piece implementation in Tensor2Tensor¹⁰ library [76]. Since sentence-piece operates on raw text, we did not tokenize the text for SPULM.

4.4 Training and Decoding

Training of NMT models is usually non-deterministic [58]. In the training of the models, there is no convergence guarantee. Most research in NMT does not specify any stopping criteria. Some mention only an approximate number of days spent to train the models [4] or the exact number of training steps [77]. Thus we trained each NMT model for 250,000 steps following the default in Tensor2Tensor. We used a single model obtained by averaging the last 12 checkpoints for decoding. Following Wu et al. [78] and Vaswani et al. [77], we used a beam search with a beam size of 4 and a length penalty of 0.6.

Because the vocabulary sizes in BPE, Word-Piece, and SPULM affect the performance of the NMT models [22, 34, 69], we trained several models with different vocabulary sizes. We also used VOLT¹¹ [79] to estimate the optimal vocabulary sizes to confirm the results. Eventually, we selected the best BPE, Word-Piece, and SPULM subword models to compare them with the Morfessor and MorphoSeg subword models.

4.5 Evaluation

Running a human evaluation (expert judgment) can be time-consuming and expensive. In practice, it can be used to compare a small number of variant systems. On the other hand, automated metrics are prevalent because they can rapidly evaluate several systems. Among the proper uses for automated evaluation metrics is comparing systems (models) that apply similar translation methods [12, 61]. Therefore, in this article, we focused on the objective evaluation of the NMT models with automated metrics.

Most automated metrics fall into two groups: metrics based on string overlap and metrics based on embedding similarity. COMET [60] is an embedding similarity-based metric and is the best-performing of all widely used metrics [44]. It strongly correlates with human judgment [44]. Because of its strongest correlation with human evaluation [28, 44], we primarily rely on it. We also used a metric based on string overlap, BLEU [57], because of its popularity [54].

We desegmented and detokenized the translation outputs. Since COMET supports Amharic, we computed it after we “de-romanized” Amharic text back into Amharic (Ethiopic) script. We used the most recent and recommended model, wmt22-comet-da, and default parameters in version 2.0 of its implementation.¹² It scales the scores between 0 and 1. A score close to 0 denotes a poor translation; a score near 1 denotes a good translation. Thus, the new COMET model makes the scores more interpretable than the old models. Moreover, for the sake of uniformity to BLEU scores, we scaled up the COMET sores by multiplying them by 100 to make the scores fall between 0 and 100. In addition, for consistency, we used the sacreBLEU¹³ [59] implementation of BLEU.¹⁴

Although largely missing in the bulk of machine translation research, statistical hypothesis testing enables us to evaluate the statistically significant differences between models [54]. We can apply a suitable test from the family of parametric tests, such as paired z-test or t-test, if the distribution of the population is known or if there is independence between the observations (examples) in the sample. These assumptions, however, do not apply to machine translation [24]. As a result, we used a non-parametric test, paired Bootstrap Resampling [25, 45], with COMET.

5 Results and Discussions

Table 3 shows pairwise comparisons of raw and transliterated Word-Piece subword models using the Bootstrap Resampling statistical significance test with COMET by taking 0.05 as a threshold for p-values. Two models are significantly different if a p-value is less than 0.05, which is indicated by an asterisk. Table 3 reveals that transliterated Word-Piece subword models outperformed the raw (untransliterated) Word-Piece subword models. To this end, in the following experiments, we used the transliterated Amharic text.

Table 3.

Direction	Model	BLEU	COMET
Amharic-to-English	Word-Piece-Transliterated	31.5	79.6
	Word-Piece-Raw	31.5	79.2 (p = .006)*
English-to-Amharic	Word-Piece-Transliterated	22.0	85.8
	Word-Piece-Raw	21.8	85.5 (p = .026)*

Table 3. Pairwise Comparisons of Raw and Transliterated Word-Piece Subword Models

Table 4 presents the results of the conventional subword models pairwise compared with the MorphoSeg model. After choosing the best subword models for BPE, Word-Piece, and SPULM, we made comparisons. For BPE, with trial training, the optimal model’s vocabulary size ranges from 2,000 to 16,000 when it was trained on joint parallel data. VOLT also suggested that 9,000 is an optimal size for BPE. For SPULM, the optimal model’s vocabulary size ranges from 4,000 to 16,000; likewise, VOLT estimated it to be 7,000. For Word-Piece, the optimal model’s vocabulary size ranges from 1,000 to 16,000, but we could not estimate it with VOLT as VOLT does not support Word-Piece. Appendix A details the results of the trial training. Additionally, we provide sample translations in Appendix B.

Table 4.

Direction	Model	BLEU	COMET
Amharic-to-English	MorphoSeg	34.0	81.6
	BPE	33.2	81.0 (p < .001)*
	Morfessor	32.7	80.6 (p < .001)*
	SPULM	33.4	81.1 (p = .005)*
	Word-Piece	32.8	80.7 (p < .001)*
English-to-Amharic	MorphoSeg	26.4	86.9
	BPE	26.6	86.6 (p = .020)*
	Morfessor	26.4	86.2 (p < .001)*
	SPULM	25.9	86.5 (p = .003)*
	Word-Piece	26.1	86.4 (p = .001)*

Table 4. Pairwise Comparisons of MorphoSeg with Conventional Subword Models

According to Table 4, the MorphoSeg subword models obtained the best scores. Hence, MorphoSeg outperforms the other methods in both translation directions. Also, when applying MorphoSeg to the Amharic dataset, we did not disambiguate the PoS of words in a sentence since the Amharic morphological analyzer HornMorpho does not have such a feature. The segmentation of a word varies with its PoS as words take on a different PoS depending on the context. If the proper disambiguation had been made, we would even expect more significant differences.

6 Conclusions and Future Work

We addressed the limitation of conventional word segmentation methods often employed for NMT. We investigated the applicability of these methods for Amharic-English translation, a typical case of fusion languages. We also devised a morpheme-based word segmentation method, MorphoSeg, as a remedy to restore phonological or orthographic changes at morpheme boundaries. MorphoSeg is a compelling word segmentation method that solely depends on a language’s morphological analyzer or treebank. In addition, we compared conventional and morpheme-based NMT subword models. To this end, we implemented the baseline Transformer-based architecture. For the training of subword models, we used different word segmentation methods to segment words into subwords, such as BPE, Morfessor, MorphoSeg, SPULM, and Word-Piece. Since the vocabulary sizes in BPE, Word-Piece, and SPULM impact the performance of the NMT models, we trained several models with different vocabulary sizes. Also, we used VOLT to estimate the vocabulary sizes to confirm the results. Eventually, we ran statistical significance tests with the COMET metric to compare conventional and morpheme-based NMT subword models. The morpheme-based models outperformed the conventional subword models in an evaluation study on a benchmark dataset, the Amharic-English parallel corpus.

Looking ahead, we propose the incorporation of linguistic knowledge into NMT models for future work. For example, the UniMorph 4.0 undertaking [5] recently provided morphological inflection tables containing morphological features for 182 varied languages. It also offered morpheme segmentation for 16 languages. In addition, it is important to investigate the efficacy of other morphological segmentation tools, such as MorphAGram [26], in low-resource NMT of fusion languages.

When applying MorphoSeg to the Amharic dataset, we did not disambiguate the PoS of words in a sentence since the Amharic morphological analyzer HornMorpho does not have such a feature. However, since the segmentation of a word varies with its PoS and words take on different PoS depending on the context, we strongly recommend the inclusion of PoS disambiguation for future research. Also, because of the scarcity of resources, we made only one run for each configuration. Since the random seeds induce noise in the models, we urge several replications to be run across different random seeds.

Acknowledgments

We want to thank Michael Gasser for his great enthusiasm and help while modifying the Amharic analyzer, HornMorpho. Our thanks also go to the anonymous reviewers because their comments have helped us to improve the paper significantly.

Footnotes

The corpus is available at http://dx.doi.org/10.24352/ub.ovgu-2018-144

The vocabulary size of the corpus is approximately 870,000.

The corpus was provided at the Third Conference on Machine Translation and is available at http://data.statmt.org/wmt18/translation-task

⁴

The modified version of HornMorpho is available at https://github.com/hltdi/HornMorpho

⁵

The batch size is given in terms of the number of source and target language tokens.

⁶

Available at http://dx.doi.org/10.24352/ub.ovgu-2018-145

⁷

Available at https://github.com/rsennrich/subword-nmt

⁸

Available at http://morpho.aalto.fi/projects/morpho

⁹

Available at https://github.com/google/sentencepiece

¹⁰

Available at https://github.com/tensorflow/Tensor2Tensor

¹¹

Available at https://github.com/Jingjing-NLP/VOLT

¹²

Available at https://github.com/unbabel/COMET

¹³

Available at https://github.com/mjpost/sacrebleu

¹⁴

Signature: nrefs:1, case:mixed, eff:no, tok:13a, smooth:exp, version:2.3.1

Appendices

A Results of Trial Training

Since the vocabulary sizes of the subword models are important for Byte Pair Encoding (BPE), Word-Piece, and Sentence Piece with Unigram Language Modeling (SPULM), we trained several models with different vocabulary sizes. Furthermore, Sennrich et al. [68] claim that learning BPE on the joint source and target languages’ text for languages that share alphabets increases the consistency of segmentation. Since we transliterated the Amharic dataset, we also considered the joint data training of BPE as an additional factor for model variation. Table 5 shows the performance results of the BPE subword models with different vocabulary sizes from 1,000 (1K) to 16,000 (16K) with the metrics BLEU and COMET by taking a subword model that has the highest COMET and BLEU scores as a baseline.

Table 5.

Direction	Model	BLEU	COMET
Amharic-to-English	BPE-Joint-2K	33.2	81.0
	BPE-1K	32.8	80.7 (p = .085)
	BPE-2K	33.3	80.8 (p = .364)
	BPE-4K	33.3	80.4 (p = .001)*
	BPE-8K	33.3	80.4 (p < .001)*
	BPE-16K	32.9	79.9 (p < .001)*
	BPE-Joint-1K	32.2	80.8 (p = .220)
	BPE-Joint-4K	32.9	80.6 (p = .089)
	BPE-Joint-8K	33.3	80.5 (p = .004)*
	BPE-Joint-16K	33.3	80.4 (p < .001)*
English-to-Amharic	BPE-4K	26.6	86.6
	BPE-1K	26.0	86.7 (p = .660)
	BPE-2K	26.4	86.6 (p = .692)
	BPE-8K	26.4	86.3 (p = .047)*
	BPE-16K	26.1	85.5 (p < .001)*
	BPE-Joint-1K	24.6	86.0 (p < .001)*
	BPE-Joint-2K	25.6	86.2 (p = .004)*
	BPE-Joint-4K	26.4	86.4 (p = .149)
	BPE-Joint-8K	26.6	86.5 (p = .412)
	BPE-Joint-16K	26.6	86.2 (p = .004)*

Table 5. Performance Results of BPE Subword Models with Different Vocabulary Sizes, Both Separate and Joint Data Training of BPE

In Table 5, the optimal vocabulary size ranges from 2K to 16K when BPE was trained on joint training data. VOLT (Vocabulary Learning via Optimal Transport) [79] also suggests that 9K is an optimal size. We further empirically analyzed the effect of BPE separate and joint data training. While we could not see significant differences among the separately trained BPE subword models in Table 5, there were differences among the jointly trained models up to one BLEU point in the Amharic-to-English translation and two BLEU points in the English-to-Amharic translation. The other metrics as well indicate similar results. Table 6 shows performance results of Word-Piece subword NMT models with different vocabulary sizes ranging from 1,000 (1K) to 32,000 (32K). We obtained optimum results when the vocabulary sizes were between 1K and 16K, but we could not estimate it with VOLT as VOLT does not support Word-Piece. The differences in vocabulary sizes induce up to 0.8 and 1.2 BLEU points in the Amharic-to-English and English-to-Amharic translations.

Table 6.

Direction	Model	BLEU	COMET
Amharic-to-English	Word-Piece-4K	32.8	80.7
	Word-Piece-1K	32.2	80.4 (p = .045)*
	Word-Piece-2K	32.2	80.3 (p = .016)*
	Word-Piece-8K	33.0	80.1 (p < .001)*
	Word-Piece-16K	32.9	80.0 (p < .001)*
	Word-Piece-32K	32.2	79.6 (p < .001)*
English-to-Amharic	Word-Piece-4K	26.1	86.4
	Word-Piece-1K	25.5	86.5 (p = .687)
	Word-Piece-2K	25.7	86.3 (p = .471)
	Word-Piece-8K	26.4	86.2 (p = .136)
	Word-Piece-16K	26.7	86.1 (p = .038)*
	Word-Piece-32K	26.7	85.8 (p < .001)*

Table 6. Performance Results of Word-Piece Subword Models with Different Vocabulary Sizes

Table 7 shows performance results of SPULM subword NMT models with different vocabulary sizes ranging from 1,000 (1K) to 32,000 (32K). We gained optimum results when the vocabulary sizes were between 4K and 16K. VOLT also suggests that 7K is an optimal size.

Table 7.

Direction	Model	BLEU	COMET
Amharic-to-English	SPULM-4K	33.4	81.1
	SPULM-1K	31.9	80.5 (p < .001)*
	SPULM-2K	32.3	80.8 (p = .064)
	SPULM-8K	33.4	81.0 (p = .427)
	SPULM-16K	33.3	81.0 (p = .590)
	SPULM-32K	33.1	80.7 (p = .016)*
English-to-Amharic	SPULM-8K	25.9	86.5
	SPULM-1K	24.5	85.8 (p < .001)*
	SPULM-2K	25.5	86.2 (p = .011)*
	SPULM-4K	26.0	86.4 (p = .490)
	SPULM-16K	26.2	86.2 (p = .026)*
	SPULM-32K	25.8	85.6 (p < .001)*

Table 7. Performance Results of SPULM Models with Different Vocabulary Sizes

B Sample Translation Outputs

The following samples show the translation of Amharic sentences into English using different subword NMT models. The samples are sorted from short to long sentences.

Source:

Transliteration:

Reference: Can we really live forever?

BPE: Can we really live forever?

Morfessor: Will we really live forever?

MorphoSeg: Can we really live forever?

SPULM: Can we really live forever?

Word-Piece: Can we really live forever? Source:

Transliteration:

Reference: Sandra quickly discovered that she had been scammed.

BPE: Sandra immediately recognized that she had been deceived.

Morfessor: Sandra saw that she had been removed.

MorphoSeg: Sandra immediately realized that she had been deceived.

SPULM: Sandra immediately realized that she had been abandoned.

Word-Piece: Sandra immediately recognized that she was mistaken.

Source:

Transliteration:

Reference: About that time, my parents asked me to come back home.

BPE: About that time, my parents asked me to return home.

Morfessor: About that time, my parents asked me to return home.

MorphoSeg: About that time, my parents asked me to go home.

SPULM: About that time, my parents asked me to go home.

Word-Piece: About that time, my parents asked me to return home.

Source:

Transliteration:

Reference: Six years later, the whole world economy collapsed.

BPE: Six years later, the whole world economic window came to an end.

Morfessor: Six years later, the global economy collapsed.

MorphoSeg: Six years later, the whole world’s economic developments have been interrupted.

SPULM: Six years later, the entire world economy has been destroyed.

Word-Piece: Six years later, the global economy sank into the world.

Source:

Transliteration:

Reference: Distressing circumstances can have a terrible impact on us.

BPE: Distressing circumstances can cause us feelings of anxiety.

Morfessor: Anxiety can cause us emotional trauma.

MorphoSeg: Distressing events can cause us pain.

SPULM: Stress can cause us emotional pain.

Word-Piece: Distressing situations can cause anxiety.

Source:

Transliteration:

Reference: Olive oil is used copiously, as it is produced there on a large scale.

BPE: Olive oil is achieved in the abundance of attack.

Morfessor: Olive oil is widely used for a high level.

MorphoSeg: Olive oils are widely used, and there is widespread use.

SPULM: Olive oil is highly guided by the product of sophistication.

Word-Piece: The olive oil is so extensive that it pushes on the abundant possible.

Source:

Transliteration:

Reference: Many doctors who visited the booth agreed that there is a need for blood conservation in surgical practice.

BPE: Many visitors have agreed that practicing surgery is vital to blood transfusion.

Morfessor: his part, dozens of doctors enjoyed the importance of keeping a brief period of blood polluted.

MorphoSeg: a number of doctors who visited became gifted at a high risk of flowing blood vessels.

SPULM: Many visitors have agreed that having a lot of surgery during the surgery is vital.

Word-Piece: Many physicians have found that it is too important to prevent blood loss during medical treatment.

Source:

Transliteration:

Reference: And like all of us, the blind take careful note of tone, which can convey a variety of emotions.

BPE: Like any human, they discern the sound and sense of enlightenment that can help us to pass on various types of blindness.

Morfessor: Like anyone, the blind notice the tone of people who can understand how to react to different ways.

MorphoSeg: Like everyone, they notice the sound of the tone of voice of the people.

SPULM: Like anyone, blind people discern the concept of an eye to convey variety of emotions.

Word-Piece: Like everyone, people’s voice and tongues introduce various kinds of emotions.

Source:

Transliteration:

Reference: It’s to be recalled that the five kidnappers after having obliged the airplane that was having a local flight on April 26, 2001 landed it in Khartoum, released the people on board and gave themselves up to the Sudan government.

BPE: It is to be recalled that five enemies gave their hands in Sudan by disturbing the air force plane that was in the country on April 18, 2001 and after scheduling the Khartoum passed on by.

Morfessor: It is to be recalled that the five kidnappers used to hand the passengers down to Sudan government by giving the heads of the passengers who were in their arms on April 18, 2001, after threatening airport.

MorphoSeg: It is to be recalled that the five kidnappers left the air force in their country on April 18, 2001 to put the passengers behind bars and handed them over to Sudan.

SPULM: It is to be recalled that the five enemies released the air force airplane that was on its way to April 18, 2001 and released their hand to Sudanese government.

Word-Piece: The five enemys are to give their hand over to Sudan with out passengers’ hand after diverting the air force on May 18, 2001, the dispute was to be recalled.

Source:

Transliteration:

Reference: In response to so called rent riots in Chicago, Illinois, U.S.A., that occurred during the great depression of the 1930’s, city officials suspended evictions and arranged for some of the rioters to get work.

BPE: During the 1930’s, world economic downfall was raised in Chicko, U.S.A., with regards to rent accounts for local oppositions and some oppositions to function in the activities of the city.

Morfessor: In the 1930’s, the world grew up during a great depression in the nation of economic depression as a result of the Jerusalem crisis, contact with housebounds, enforced security officials from house to house, and oppositions stopped.

MorphoSeg: In the 1930’s, during the great depression in the world, an violence in the United States was formed in rebellion against houses, so the city authorities could stand up to ground troops to get jobs from their rent.

SPULM: During the 1930’s, an average of great economic breakthroughs in the world between China, U.S.A., U.S.A., and some staffs of the town’s authorities had influenced themselves to stop checking chores.

Word-Piece: During the 1930’s a great depression on the world’s economic depression, Missouri, U.S.A., U.S. home rebellion was issued, and some opposers promoted headquarters and opposed virtually.

References

[1]

Ben Allison, David Guthrie, and Louise Guthrie. 2006. Another look at the data sparsity problem. In Proceedings of the 9th International Conference on Text, Speech and Dialogue, TSD 2006, Brno, Czech Republic, September 11-15, 2006, Proceedings(Lecture Notes in Computer Science, Vol. 4188). Springer, 327–334.

Abstract

1 Introduction

2 Related Work

3 Morpheme-Based Word Segmentation

3.1 Morphology

3.2 Word Segmentation

4 Experiments and Evaluation

4.1 Baseline System

4.2 Amharic Transliteration

4.3 Dataset and Preprocessing

4.4 Training and Decoding

4.5 Evaluation

5 Results and Discussions

6 Conclusions and Future Work

Acknowledgments

Footnotes

A Results of Trial Training

B Sample Translation Outputs

References

Index Terms

Recommendations

Neural Machine Translation for Low-resource Languages: A Survey

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations