[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
note
Open access

Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion Languages

Published: 22 September 2023 Publication History

Abstract

Neural approaches, which are currently state-of-the-art in many areas, have contributed significantly to the exciting advancements in machine translation. However, Neural Machine Translation (NMT) requires a substantial quantity and good quality parallel training data to train the best model. A large amount of training data, in turn, increases the underlying vocabulary exponentially. Therefore, several proposed methods have been devised for relatively limited vocabulary due to constraints of computing resources such as system memory. Encoding words as sequences of subword units for so-called open-vocabulary translation is an effective strategy for solving this problem. However, the conventional methods for splitting words into subwords focus on statistics-based approaches that mainly conform to agglutinative languages. In these languages, the morphemes have relatively clean boundaries. These methods still need to be thoroughly investigated for their applicability to fusion languages, which is the main focus of this article. Phonological and orthographic processes alter the borders of constituent morphemes of a word in fusion languages. Therefore, it makes it difficult to distinguish the actual morphemes that carry syntactic or semantic information from the word’s surface form, the form of the word as it appears in the text. We, thus, resorted to a word segmentation method that segments words by restoring the altered morphemes. We also compared conventional and morpheme-based NMT subword models. We could prove that morpheme-based models outperform conventional subword models on a benchmark dataset.

1 Introduction

Machine translation is challenging because of language differences such as morphological variations [23]. Categorizing languages for cross-linguistic comparison is also difficult [39]. One way to make such a comparison is by assessing the dimensions of morphological typology. According to Jurafsky and Martin [42], morphological typology can vary in two dimensions: the first dimension ranges from isolating to polysynthetic, and the second dimension ranges from agglutinative to fusional. The first dimension relates to the number of morphemes per word. In isolating morphology, words typically consist of only one morpheme, while in polysynthetic morphology, words have multiple morphemes. The second dimension has to do with how segmentable morphemes are. It encompasses morpheme boundaries that are generally clear in agglutinative morphology and morpheme boundaries that are hazy in fusional morphology. The dimensions can be exemplified by Vietnamese (isolating), Siberian Yupik (polysynthetic), Turkish (agglutinative), and Amharic (fusional).
Different approaches have been used up to this point to automate the intricate task of translation. The initial attempt involved using rule-based systems to translate a text from the source language. However, developing rule-based systems is time-consuming and costly because it is difficult to codify all the essential language knowledge for accurate translations using hand-crafted rules. Besides, it has limited scalability. It also necessitates considerable linguistic knowledge and resources that might not be available for low-resource languages [38]. Therefore, alternative data-driven strategies emerged as parallel corpora became more widely accessible. Such methods benefit from the accurate translations produced by human translators as they use curated parallel training data, or parallel corpora, to create translation models by relying on machine learning.
The two most well-known data-driven approaches are Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). NMT has surpassed SMT in recent years [4, 69, 73, 77]. Its success is attributed to its unique characteristics. First, unlike SMT, the whole NMT system components can be jointly tuned to optimize the translation performance. Second, it processes complete sentences, not just words or n-grams, like SMT. Third, it handles syntactic and semantic differences in languages better than SMT [7, 13]. Ultimately, it produces more fluent translations than SMT [47]. Nevertheless, the amount and quality of parallel training data significantly affect the performance of NMT models [36]. Very few of the approximately 7,000 languages spoken today have adequate training data with the required amount and quality for NMT.
Additionally, because NMT only works with a fixed vocabulary due to the constraints of computing resources such as the GPU’s dedicated memory, it has trouble handling rare and out-of-vocabulary words in texts [68]. When a word has multiple morphemes, as in synthetic languages such as Amharic and Turkish, the issue is exacerbated. In these languages, a single word may have thousands of different inflections, and the languages’ lexicon may number in the hundreds of thousands or millions. For instance, in Amharic, the official language of Ethiopia, the word // meaning “to be” has roughly five-thousand inflections in the 22 million tokens of Contemporary Amharic Corpus1 (CACO2) [33]. In Amharic, a space-delimited word may represent a phrase, clause, or sentence. For example, the word // meaning “until he explains it to them” is a clause. This word does not appear even once in the CACO corpus. Nevertheless, its constituent morphemes — //-//-//-//-// — appear several times in the corpus, being part of other words. Hence, the vocabulary of the language is too large for NMT.
Many suggested methods have been established for relatively small vocabulary because NMT is computationally resource-intensive and necessitates vast quantities of parallel training corpora. Segmenting words as sequences of subword units for so-called open-vocabulary translation is one effective way to address this issue [68]. By doing this, it may train the models using all words. As a result, it can utilize the limited training data effectively. Furthermore, because some of the unknown words are only variations of existing words already included in the training data, the method also somewhat helps solve the issue of out-of-vocabulary words. However, the most conventional word segmentation methods commonly used in NMT [55], such as Byte Pair Encoding (BPE) [68], Word-Piece [67, 78], and Sentence Piece with Unigram Language Modeling (SPULM) [48] conform to agglutinative morphology. They depend on statistics-based methods for splitting words into subword units. The boundaries between morphemes, or meaningful word components, are generally clear in agglutinative morphology. The suitability of these methods for fusional morphology needs to be examined. The borders of constituent morphemes in fusion languages are altered by phonological and orthographic processes, making it difficult to separate the actual morphemes that carry syntactic or semantic information from written words or surface forms. For instance, in the word unreasonably, the morphemes un and reason are simply concatenated or agglutinated. The subword ably is fused two morphemes, able and ly; it, thus, exhibits fusional morphology. Table 1 demonstrates the segmentation of unreasonably using different segmentation methods. All methods would either under-segment or over-segment it. Therefore, we resorted to a morphological word segmentation method that segments morphemes by restoring the actual morphemes. We also compared conventional and morpheme-based NMT subword models in an evaluation study on a benchmark dataset, Amharic-English parallel corpus.
Table 1.
MethodSentence
BPEun-reason-ably
SPULMun-re-as-on-ab-ly
Word-Pieceun-reas-ona-bl-y
Table 1. Segmentation of the English Word Unreasonably Using Different Methods
In the following, after we discuss related work in Section 2, in Section 3, we explain the proposed word segmentation method. To apply and evaluate the method, we developed a baseline NMT system. In Section 4, we compare conventional and morpheme-based NMT subword models. In Section 5, we report the results of the evaluation study. In the last section, Section 6, we give conclusions and future research direction.

2 Related Work

Although translation is an open-vocabulary problem, NMT models operate with a fixed vocabulary due to the limitations of computational resources. During the training of NMT models, the top most frequent words, commonly between 30,000 and 80,000, are included in the vocabulary [4, 73]. Both unseen words at training time and less frequent (rare) words, thus, will be out-of-vocabulary words. In practical NMT model training, a unique token represents them. This technique works well when there are only a few out-of-vocabulary words. However, the translation performance degrades rapidly as the number of out-of-vocabulary words increases [4, 15]. The problem worsens for languages dominated by synthetic morphology, either agglutinative or fusional. These languages can have hundreds of thousands, if not millions, of words in their vocabulary, most of which become out-of-vocabulary words. The worst-case scenario is when we use small training data of synthetic low-resource languages, which brings forth several out-of-vocabulary words during inference.
Another procedure to address the translation problem of out-of-vocabulary words is a back-off to a dictionary lookup [41, 52]. Nonetheless, this approach requires supplementary resources such as bilingual lexicons, which may only be readily available for some low-resource languages. It also makes assumptions that only sometimes hold in reality, like a one-to-one correspondence between the source and target language words [68].
A feasible solution to make an NMT model capable of open-vocabulary translation is segmenting words as sequences of subword units [48, 67, 68, 78, 80]. The extreme case can be a character-level segmentation [16, 51]. However, compared with higher-level subwords, translating characters results in longer sequences, which is challenging for both modeling and computation [55].
In subword NMT models, most conventional word segmentation methods follow statistics-based approaches that use a data compression method [29] to reduce text entropy [70], the idea derived from information theory. We can consider a text as a sequence of symbols (i.e., words or subwords), where each symbol is generated with a certain probability and carries a certain information content [9]. The higher the probability of a symbol, the lower its information content [37]. According to Mielke et al. [55], the most conventional word segmentation methods commonly used in NMT are BPE [68], Word-Piece [67, 78], and SPULM [48]. BPE iteratively replaces the most frequent pair of characters in a sequence with a single, unused character. A subword learner first decomposes the entire training text into single characters. Then, it induces a vocabulary by iteratively merging the most frequent adjacent pairs of characters or subwords until the desired subword vocabulary is achieved. Once the subword vocabulary is learned, a segmenter splits words in a text by greedily segmenting words with the longest available subword type. Word-Piece is analogous to BPE. However, while BPE uses co-occurrence frequency to apply potential mergers of subwords, Word-Piece relies on the likelihood of an n-gram language model trained on a version of the training text that contains the merged subwords. SPULM, on the other hand, is a fully probabilistic method based on a unigram language model. Unlike BPE or Word-Piece, SPULM builds the vocabulary using a top-down approach. It starts with a vast starting vocabulary containing all characters and the most frequent subword candidates in the training text. Then, it iteratively removes subwords from the vocabulary that do not improve the overall probability. It is similar to Morfessor’s unsupervised segmentation [18], apart from Morfessor’s informed priority over subword length [11, 62].
The conventional methods for word segmentation are language-independent. Remarkably, they work well for agglutinative languages, in which words are formed by concatenating morphemes, since they work only with the surface form of words in estimating subword units. However, they overlook the morphology of fusion languages, in which words are formed by blending several morphemes. As a result, they may lead to the loss of semantic or syntactic information contained in the word structure. Nevertheless, there are several variants to the purely statistics-based conventional word segmentation methods to make them morphology-aware [3, 40, 53, 56, 66]. These modifications, however, did not seem to improve the original methods for translation of a few low-resource language pairs [21, 56, 64, 75].
Another issue with using traditional segmentation methods in low-resource settings is determining the optimal vocabulary size based on the degree of segmentation [22, 34, 69]. There are mixed results regarding the optimal vocabulary size when training subword NMT models. While Wu et al. [78] and Denkowski and Neubig [20] recommend a value between 8,000 and 32,000 for the vocabulary size, Cherry et al. [14] and Ding et al. [22] argue that such large vocabularies degrade the performance of the models, especially in low-data conditions. Thus, the size of the vocabulary needs to be tailored to the dataset. Therefore, we need to train several models with different possible vocabulary sizes to obtain the best model. However, since this trial training involves high computational costs, some techniques have been proposed to estimate the optimal vocabulary size. Salesky et al. [63] proposed a method that gradually introduces new BPE vocabulary online based on the persistent validation loss. It starts with smaller, general subwords and adds larger, more specific units as training progresses. Xu et al. [79] proposed another efficient solution, VOLT (Vocabulary Learning via Optimal Transport), by applying the Economics concept of marginal utility [65], where the benefit is text entropy and the cost is vocabulary size. On the one hand, increasing vocabulary size reduces text entropy, which benefits model learning [8]. On the other hand, an extensive vocabulary leads to parameter explosions and data sparseness, which is detrimental to model learning [1]. Therefore, Xu et al. [79] formulated vocabulary construction as an optimization problem aimed at finding the optimal vocabulary size with the highest marginal utility.

3 Morpheme-Based Word Segmentation

In fusion languages, phonological and orthographic processes modify the boundaries of the actual morphemes. To restore the altered morphemes, the most straightforward approach is to examine the morphological analysis or treebanks of the languages. The following subsections discuss the morphology and morpheme-based segmentation of the predominantly fusion language Amharic along with English.

3.1 Morphology

English has a relatively simple fusional morphology. For example, in the word unreasonably, the morphemes are un, reason, able, and ly. Thus, the subword ably blends two morphemes: able and ly; to obtain the actual morphemes, we need to restore the letters le. Amharic, on the other hand, has a rich morphology. In Amharic, a space-delimited word is a blend of several morphemes. It may function as a word (e.g., // meaning “human”); a phrase (e.g., // meaning “from her house”); a clause (e.g., // meaning “the one who came”); or even a sentence (e.g., // meaning “She did not eat.”).
Amharic is dominated by fusional morphology; the boundaries of morphemes are unclear in many words. Like other Semitic languages, Amharic word formation rides on root-and-pattern morphology. Root-and-pattern morphology is non-agglutinative because the two morphemes that make up the word, the root and pattern, are interlaced instead of concatenated [27]. For example, the Amharic verbs // “he/it will break” and // “he/it will be broken” have a prefix and a suffix to indicate tense and aspect. When removing the affixes from both words, the stems // and // remain; they are composed of two parts, the root consisting of the consonant sequence //, and the pattern consisting of a template of vowels. In the first word, the pattern consists of the vowel // between the first and second consonant and no vowel between the second and third consonant, i.e., //; in the second word, the pattern consists of the same vowel in both positions, i.e., //.

3.2 Word Segmentation

We devised a morphological word segmentation method, MorphoSeg, based solely on a language’s morphological analyzer or treebank. It segments actual morphemes from words by restoring morphemes that phonological and orthographic processes have altered.
Universal Morphology (UniMorph) [5] does have a morpheme segmentation database for English, but most entries have shallow segmentations as far as our need is concerned. For instance, the adjective unaccountable is not segmented at all, even if we expect the segmentation to be un-account-able. Therefore, we used a morphology treebank manually curated by Cotterell et al. [17] as a seed for English morpheme-based word segmentation. The morphology treebank consists of about 7,000 word types. To increase its coverage, we extracted all sentences from the monolingual News Crawl corpus.3 First, we lemmatize each word in each sentence using the Word Net Lemmatizer and Part-of-Speech (PoS) Tagger in the Natural Language Toolkit (NLTK) [10]. Then, we check whether the lemma is in the treebank and has further segmentation. If it does so, then the word is segmented accordingly. Otherwise, the word is segmented based on its lemma and the remaining subwords. Due to the relatively simpler morphology of English, most of the remaining subwords are either prefixes or suffixes. For example, the noun achievements has a lemma achievement; therefore, the initial segmentation is achievement-s; yet achievement is segmented in the treebank as achieve-ment. Thus, the final segmentation will be achieve-ment-s. Eventually, we created a morpheme segmentation database for nearly 42,000 word types along with their PoS. We have provided the English morpheme segmentation database at https://github.com/andmek/EngSegTable.
The words in an English sentence are successively segmented using MorphoSeg using the morpheme segmentation database. To ascertain the PoS of each word, it makes use of the NLTK’s PoS tagger. The segmentation for that word is utilized instead if a word type and its PoS are found in the lookup table. If not, the word is simply returned because it is assumed that it cannot be segmented. For example, it segments “She acts unreasonably and without knowledge.” by finding each word, its PoS, and segmentation in the lookup table, resulting in “She act-s un-reason-able-ly and without knowledge”.
We used a morphological analyzer and generator, HornMorpho [30], for Amharic morpheme-based word segmentation. HornMorpho is a rule-based system for morphological analysis and generation. It forms a cascade of composite finite-state transducers that implement a lexicon of roots and morphemes, as well as alternation rules that govern phonological or orthographic changes at morpheme boundaries [6]. HornMorpho analyzes only nouns and verbs prior to version 2.5. Since Amharic adjectives behave like nouns, HornMorpho also does not distinguish between adjectives and nouns. It cannot handle compound words and light verb constructions either. Therefore, we helped the author to modify HornMorpho.4 The improved version distinguishes more parts of speech, such as verbs, nouns, adjectives, adverbs, and conjunctions. It has more lexicons than before; it also performs morphological analysis for constructions such as light verbs and compound words. Batsuren et al. [5] also used it in the UniMorph 4.0 project to generate the Amharic inflectional data.
We extracted all distinct words from the CACO corpus to compile a morpheme segmentation database. To create the database, we used HornMorpho’s analyzer by removing the grammatical features. For example, HornMorpho analyzes // as (subject = 3rd person singular masculine)--(infinitive)-(object = 1st person)-(auxiliary); when removing the grammatical features in the parentheses, it becomes ---- //. When HornMorpho provided multiple analyses of a word, we took the first analysis; we have not disambiguated the PoS of words in a sentence as HornMorpho does not have such a feature. Finally, we created a morpheme segmentation database for approximately 840,000 word types. We have provided the morpheme segmentation database at https://github.com/andmek/AmhSegTable.
Using the morpheme segmentation database, MorphoSeg sequentially segments the words in an Amharic sentence. It initially fills a lookup table with the database’s entries. If a word type is present in the lookup table, the segmentation for that word is used in its place. If not, it assumes that the word cannot be subdivided and simply returns it. For example, to segment “”, meaning “It is better for me to live with my mother.”, it first transliterates the sentence as “”; then, it segments each word by searching the lookup table for its segmentation, outputting “

4 Experiments and Evaluation

Using a baseline Transformer-based encoder-decoder system, we trained and evaluated several subword models. We made the experiments in two scenarios. In the first scenario, we applied Word-Piece on the raw (untransliterated) Amharic text to segment words into subwords with a vocabulary size of 32,000 based on Wu et al. [78]. In the second scenario, we used different word segmentation methods, such as BPE, Morfessor, MorphoSeg, SPULM, and Word-Piece, applying them to transliterated Amharic text. Nevertheless, we used the same dataset and followed similar preprocessing, training, and evaluation steps for both scenarios.

4.1 Baseline System

The encoder-decoder network is a de facto architecture for NMT. An NMT system can implement the encoder and decoder with recurrent neural networks or Transformers [77]. The Transformer-based models attain the highest performance in both high- and low-resource conditions [2, 50, 69]. Thus, we used the Transformer-based encoder-decoder architecture to train NMT models. It uses the Adam optimizer [43], with varied learning rates throughout the training, a dropout [72] rate of 0.1, label smoothing [74] value of 0.1, and batch size5 of 1,024. We used the Tensor2Tensor [76] library to implement the system. Because the library supports only a joint subword vocabulary, we used the joint vocabulary of both source and target languages.

4.2 Amharic Transliteration

Transliteration or character mapping facilitates vocabulary sharing, especially loan words and named entities, between languages [19, 35]. Since we used a joint subword vocabulary for Amharic and English, we examined the orthography of Amharic and developed a rule-based transliteration method, Amharic Transliteration for Machine Translation. It maps Amharic characters to their phonemic representations in Latin-based characters. We detailed the method in Gezmu et al. [31] and provided its implementation at https://github.com/andmek/AT4MT.

4.3 Dataset and Preprocessing

We trained our models on the benchmark dataset of the Amharic-English6 [32] parallel corpus. Table 2 shows the number of sentence (segment) pairs in the dataset.
Table 2.
DatasetNumber of Sentence Pairs
Test Set2,500
Validation Set2,864
Training Set140,000
Table 2. The Number of Sentence (Segment) Pairs in the Amharic-English Parallel Data
We preprocessed the dataset with the standard Moses tool [46] to prepare it for machine translation training. We tokenized the English data with Moses’ tokenizer script; we modified Moses’ script to tokenize the Amharic data. We used BPE, Morfessor, MorphoSeg, SPULM, and Word-Piece to segment words in the datasets for subword models. We used the BPE7 implementation in Sennrich et al. [68]; the Morfessor 2.08 implementation in Smit et al. [71]; the SPULM implementation in the sentence-piece9 library [49]; and the Word-Piece implementation in Tensor2Tensor10 library [76]. Since sentence-piece operates on raw text, we did not tokenize the text for SPULM.

4.4 Training and Decoding

Training of NMT models is usually non-deterministic [58]. In the training of the models, there is no convergence guarantee. Most research in NMT does not specify any stopping criteria. Some mention only an approximate number of days spent to train the models [4] or the exact number of training steps [77]. Thus we trained each NMT model for 250,000 steps following the default in Tensor2Tensor. We used a single model obtained by averaging the last 12 checkpoints for decoding. Following Wu et al. [78] and Vaswani et al. [77], we used a beam search with a beam size of 4 and a length penalty of 0.6.
Because the vocabulary sizes in BPE, Word-Piece, and SPULM affect the performance of the NMT models [22, 34, 69], we trained several models with different vocabulary sizes. We also used VOLT11 [79] to estimate the optimal vocabulary sizes to confirm the results. Eventually, we selected the best BPE, Word-Piece, and SPULM subword models to compare them with the Morfessor and MorphoSeg subword models.

4.5 Evaluation

Running a human evaluation (expert judgment) can be time-consuming and expensive. In practice, it can be used to compare a small number of variant systems. On the other hand, automated metrics are prevalent because they can rapidly evaluate several systems. Among the proper uses for automated evaluation metrics is comparing systems (models) that apply similar translation methods [12, 61]. Therefore, in this article, we focused on the objective evaluation of the NMT models with automated metrics.
Most automated metrics fall into two groups: metrics based on string overlap and metrics based on embedding similarity. COMET [60] is an embedding similarity-based metric and is the best-performing of all widely used metrics [44]. It strongly correlates with human judgment [44]. Because of its strongest correlation with human evaluation [28, 44], we primarily rely on it. We also used a metric based on string overlap, BLEU [57], because of its popularity [54].
We desegmented and detokenized the translation outputs. Since COMET supports Amharic, we computed it after we “de-romanized” Amharic text back into Amharic (Ethiopic) script. We used the most recent and recommended model, wmt22-comet-da, and default parameters in version 2.0 of its implementation.12 It scales the scores between 0 and 1. A score close to 0 denotes a poor translation; a score near 1 denotes a good translation. Thus, the new COMET model makes the scores more interpretable than the old models. Moreover, for the sake of uniformity to BLEU scores, we scaled up the COMET sores by multiplying them by 100 to make the scores fall between 0 and 100. In addition, for consistency, we used the sacreBLEU13 [59] implementation of BLEU.14
Although largely missing in the bulk of machine translation research, statistical hypothesis testing enables us to evaluate the statistically significant differences between models [54]. We can apply a suitable test from the family of parametric tests, such as paired z-test or t-test, if the distribution of the population is known or if there is independence between the observations (examples) in the sample. These assumptions, however, do not apply to machine translation [24]. As a result, we used a non-parametric test, paired Bootstrap Resampling [25, 45], with COMET.

5 Results and Discussions

Table 3 shows pairwise comparisons of raw and transliterated Word-Piece subword models using the Bootstrap Resampling statistical significance test with COMET by taking 0.05 as a threshold for p-values. Two models are significantly different if a p-value is less than 0.05, which is indicated by an asterisk. Table 3 reveals that transliterated Word-Piece subword models outperformed the raw (untransliterated) Word-Piece subword models. To this end, in the following experiments, we used the transliterated Amharic text.
Table 3.
DirectionModelBLEUCOMET
Amharic-to-EnglishWord-Piece-Transliterated31.579.6
 Word-Piece-Raw31.579.2 (p = .006)*
English-to-AmharicWord-Piece-Transliterated22.085.8
 Word-Piece-Raw21.885.5 (p = .026)*
Table 3. Pairwise Comparisons of Raw and Transliterated Word-Piece Subword Models
Table 4 presents the results of the conventional subword models pairwise compared with the MorphoSeg model. After choosing the best subword models for BPE, Word-Piece, and SPULM, we made comparisons. For BPE, with trial training, the optimal model’s vocabulary size ranges from 2,000 to 16,000 when it was trained on joint parallel data. VOLT also suggested that 9,000 is an optimal size for BPE. For SPULM, the optimal model’s vocabulary size ranges from 4,000 to 16,000; likewise, VOLT estimated it to be 7,000. For Word-Piece, the optimal model’s vocabulary size ranges from 1,000 to 16,000, but we could not estimate it with VOLT as VOLT does not support Word-Piece. Appendix A details the results of the trial training. Additionally, we provide sample translations in Appendix B.
Table 4.
DirectionModelBLEUCOMET
Amharic-to-EnglishMorphoSeg34.081.6
 BPE33.281.0 (p < .001)*
 Morfessor32.780.6 (p < .001)*
 SPULM33.481.1 (p = .005)*
 Word-Piece32.880.7 (p < .001)*
English-to-AmharicMorphoSeg26.486.9
 BPE26.686.6 (p = .020)*
 Morfessor26.486.2 (p < .001)*
 SPULM25.986.5 (p = .003)*
 Word-Piece26.186.4 (p = .001)*
Table 4. Pairwise Comparisons of MorphoSeg with Conventional Subword Models
According to Table 4, the MorphoSeg subword models obtained the best scores. Hence, MorphoSeg outperforms the other methods in both translation directions. Also, when applying MorphoSeg to the Amharic dataset, we did not disambiguate the PoS of words in a sentence since the Amharic morphological analyzer HornMorpho does not have such a feature. The segmentation of a word varies with its PoS as words take on a different PoS depending on the context. If the proper disambiguation had been made, we would even expect more significant differences.

6 Conclusions and Future Work

We addressed the limitation of conventional word segmentation methods often employed for NMT. We investigated the applicability of these methods for Amharic-English translation, a typical case of fusion languages. We also devised a morpheme-based word segmentation method, MorphoSeg, as a remedy to restore phonological or orthographic changes at morpheme boundaries. MorphoSeg is a compelling word segmentation method that solely depends on a language’s morphological analyzer or treebank. In addition, we compared conventional and morpheme-based NMT subword models. To this end, we implemented the baseline Transformer-based architecture. For the training of subword models, we used different word segmentation methods to segment words into subwords, such as BPE, Morfessor, MorphoSeg, SPULM, and Word-Piece. Since the vocabulary sizes in BPE, Word-Piece, and SPULM impact the performance of the NMT models, we trained several models with different vocabulary sizes. Also, we used VOLT to estimate the vocabulary sizes to confirm the results. Eventually, we ran statistical significance tests with the COMET metric to compare conventional and morpheme-based NMT subword models. The morpheme-based models outperformed the conventional subword models in an evaluation study on a benchmark dataset, the Amharic-English parallel corpus.
Looking ahead, we propose the incorporation of linguistic knowledge into NMT models for future work. For example, the UniMorph 4.0 undertaking [5] recently provided morphological inflection tables containing morphological features for 182 varied languages. It also offered morpheme segmentation for 16 languages. In addition, it is important to investigate the efficacy of other morphological segmentation tools, such as MorphAGram [26], in low-resource NMT of fusion languages.
When applying MorphoSeg to the Amharic dataset, we did not disambiguate the PoS of words in a sentence since the Amharic morphological analyzer HornMorpho does not have such a feature. However, since the segmentation of a word varies with its PoS and words take on different PoS depending on the context, we strongly recommend the inclusion of PoS disambiguation for future research. Also, because of the scarcity of resources, we made only one run for each configuration. Since the random seeds induce noise in the models, we urge several replications to be run across different random seeds.

Acknowledgments

We want to thank Michael Gasser for his great enthusiasm and help while modifying the Amharic analyzer, HornMorpho. Our thanks also go to the anonymous reviewers because their comments have helped us to improve the paper significantly.

Footnotes

2
The vocabulary size of the corpus is approximately 870,000.
3
The corpus was provided at the Third Conference on Machine Translation and is available at http://data.statmt.org/wmt18/translation-task
4
The modified version of HornMorpho is available at https://github.com/hltdi/HornMorpho
5
The batch size is given in terms of the number of source and target language tokens.
14
Signature: nrefs:1, case:mixed, eff:no, tok:13a, smooth:exp, version:2.3.1
Appendices

A Results of Trial Training

Since the vocabulary sizes of the subword models are important for Byte Pair Encoding (BPE), Word-Piece, and Sentence Piece with Unigram Language Modeling (SPULM), we trained several models with different vocabulary sizes. Furthermore, Sennrich et al. [68] claim that learning BPE on the joint source and target languages’ text for languages that share alphabets increases the consistency of segmentation. Since we transliterated the Amharic dataset, we also considered the joint data training of BPE as an additional factor for model variation. Table 5 shows the performance results of the BPE subword models with different vocabulary sizes from 1,000 (1K) to 16,000 (16K) with the metrics BLEU and COMET by taking a subword model that has the highest COMET and BLEU scores as a baseline.
Table 5.
DirectionModelBLEUCOMET
Amharic-to-EnglishBPE-Joint-2K33.281.0
 BPE-1K32.880.7 (p = .085)
 BPE-2K33.380.8 (p = .364)
 BPE-4K33.380.4 (p = .001)*
 BPE-8K33.380.4 (p < .001)*
 BPE-16K32.979.9 (p < .001)*
 BPE-Joint-1K32.280.8 (p = .220)
 BPE-Joint-4K32.980.6 (p = .089)
 BPE-Joint-8K33.380.5 (p = .004)*
 BPE-Joint-16K33.380.4 (p < .001)*
English-to-AmharicBPE-4K26.686.6
 BPE-1K26.086.7 (p = .660)
 BPE-2K26.486.6 (p = .692)
 BPE-8K26.486.3 (p = .047)*
 BPE-16K26.185.5 (p < .001)*
 BPE-Joint-1K24.686.0 (p < .001)*
 BPE-Joint-2K25.686.2 (p = .004)*
 BPE-Joint-4K26.486.4 (p = .149)
 BPE-Joint-8K26.686.5 (p = .412)
 BPE-Joint-16K26.686.2 (p = .004)*
Table 5. Performance Results of BPE Subword Models with Different Vocabulary Sizes, Both Separate and Joint Data Training of BPE
In Table 5, the optimal vocabulary size ranges from 2K to 16K when BPE was trained on joint training data. VOLT (Vocabulary Learning via Optimal Transport) [79] also suggests that 9K is an optimal size. We further empirically analyzed the effect of BPE separate and joint data training. While we could not see significant differences among the separately trained BPE subword models in Table 5, there were differences among the jointly trained models up to one BLEU point in the Amharic-to-English translation and two BLEU points in the English-to-Amharic translation. The other metrics as well indicate similar results. Table 6 shows performance results of Word-Piece subword NMT models with different vocabulary sizes ranging from 1,000 (1K) to 32,000 (32K). We obtained optimum results when the vocabulary sizes were between 1K and 16K, but we could not estimate it with VOLT as VOLT does not support Word-Piece. The differences in vocabulary sizes induce up to 0.8 and 1.2 BLEU points in the Amharic-to-English and English-to-Amharic translations.
Table 6.
DirectionModelBLEUCOMET
Amharic-to-EnglishWord-Piece-4K32.880.7
 Word-Piece-1K32.280.4 (p = .045)*
 Word-Piece-2K32.280.3 (p = .016)*
 Word-Piece-8K33.080.1 (p < .001)*
 Word-Piece-16K32.980.0 (p < .001)*
 Word-Piece-32K32.279.6 (p < .001)*
English-to-AmharicWord-Piece-4K26.186.4
 Word-Piece-1K25.586.5 (p = .687)
 Word-Piece-2K25.786.3 (p = .471)
 Word-Piece-8K26.486.2 (p = .136)
 Word-Piece-16K26.786.1 (p = .038)*
 Word-Piece-32K26.785.8 (p < .001)*
Table 6. Performance Results of Word-Piece Subword Models with Different Vocabulary Sizes
Table 7 shows performance results of SPULM subword NMT models with different vocabulary sizes ranging from 1,000 (1K) to 32,000 (32K). We gained optimum results when the vocabulary sizes were between 4K and 16K. VOLT also suggests that 7K is an optimal size.
Table 7.
DirectionModelBLEUCOMET
Amharic-to-EnglishSPULM-4K33.481.1
 SPULM-1K31.980.5 (p < .001)*
 SPULM-2K32.380.8 (p = .064)
 SPULM-8K33.481.0 (p = .427)
 SPULM-16K33.381.0 (p = .590)
 SPULM-32K33.180.7 (p = .016)*
English-to-AmharicSPULM-8K25.986.5
 SPULM-1K24.585.8 (p < .001)*
 SPULM-2K25.586.2 (p = .011)*
 SPULM-4K26.086.4 (p = .490)
 SPULM-16K26.286.2 (p = .026)*
 SPULM-32K25.885.6 (p < .001)*
Table 7. Performance Results of SPULM Models with Different Vocabulary Sizes

B Sample Translation Outputs

The following samples show the translation of Amharic sentences into English using different subword NMT models. The samples are sorted from short to long sentences.
Source:
Transliteration:
Reference: Can we really live forever?
BPE: Can we really live forever?
Morfessor: Will we really live forever?
MorphoSeg: Can we really live forever?
SPULM: Can we really live forever?
Word-Piece: Can we really live forever? Source:
Transliteration:
Reference: Sandra quickly discovered that she had been scammed.
BPE: Sandra immediately recognized that she had been deceived.
Morfessor: Sandra saw that she had been removed.
MorphoSeg: Sandra immediately realized that she had been deceived.
SPULM: Sandra immediately realized that she had been abandoned.
Word-Piece: Sandra immediately recognized that she was mistaken.
Source:
Transliteration:
Reference: About that time, my parents asked me to come back home.
BPE: About that time, my parents asked me to return home.
Morfessor: About that time, my parents asked me to return home.
MorphoSeg: About that time, my parents asked me to go home.
SPULM: About that time, my parents asked me to go home.
Word-Piece: About that time, my parents asked me to return home.
Source:
Transliteration:
Reference: Six years later, the whole world economy collapsed.
BPE: Six years later, the whole world economic window came to an end.
Morfessor: Six years later, the global economy collapsed.
MorphoSeg: Six years later, the whole world’s economic developments have been interrupted.
SPULM: Six years later, the entire world economy has been destroyed.
Word-Piece: Six years later, the global economy sank into the world.
Source:
Transliteration:
Reference: Distressing circumstances can have a terrible impact on us.
BPE: Distressing circumstances can cause us feelings of anxiety.
Morfessor: Anxiety can cause us emotional trauma.
MorphoSeg: Distressing events can cause us pain.
SPULM: Stress can cause us emotional pain.
Word-Piece: Distressing situations can cause anxiety.
Source:
Transliteration:
Reference: Olive oil is used copiously, as it is produced there on a large scale.
BPE: Olive oil is achieved in the abundance of attack.
Morfessor: Olive oil is widely used for a high level.
MorphoSeg: Olive oils are widely used, and there is widespread use.
SPULM: Olive oil is highly guided by the product of sophistication.
Word-Piece: The olive oil is so extensive that it pushes on the abundant possible.
Source:
Transliteration:
Reference: Many doctors who visited the booth agreed that there is a need for blood conservation in surgical practice.
BPE: Many visitors have agreed that practicing surgery is vital to blood transfusion.
Morfessor: his part, dozens of doctors enjoyed the importance of keeping a brief period of blood polluted.
MorphoSeg: a number of doctors who visited became gifted at a high risk of flowing blood vessels.
SPULM: Many visitors have agreed that having a lot of surgery during the surgery is vital.
Word-Piece: Many physicians have found that it is too important to prevent blood loss during medical treatment.
Source:
Transliteration:
Reference: And like all of us, the blind take careful note of tone, which can convey a variety of emotions.
BPE: Like any human, they discern the sound and sense of enlightenment that can help us to pass on various types of blindness.
Morfessor: Like anyone, the blind notice the tone of people who can understand how to react to different ways.
MorphoSeg: Like everyone, they notice the sound of the tone of voice of the people.
SPULM: Like anyone, blind people discern the concept of an eye to convey variety of emotions.
Word-Piece: Like everyone, people’s voice and tongues introduce various kinds of emotions.
Source:
Transliteration:
Reference: It’s to be recalled that the five kidnappers after having obliged the airplane that was having a local flight on April 26, 2001 landed it in Khartoum, released the people on board and gave themselves up to the Sudan government.
BPE: It is to be recalled that five enemies gave their hands in Sudan by disturbing the air force plane that was in the country on April 18, 2001 and after scheduling the Khartoum passed on by.
Morfessor: It is to be recalled that the five kidnappers used to hand the passengers down to Sudan government by giving the heads of the passengers who were in their arms on April 18, 2001, after threatening airport.
MorphoSeg: It is to be recalled that the five kidnappers left the air force in their country on April 18, 2001 to put the passengers behind bars and handed them over to Sudan.
SPULM: It is to be recalled that the five enemies released the air force airplane that was on its way to April 18, 2001 and released their hand to Sudanese government.
Word-Piece: The five enemys are to give their hand over to Sudan with out passengers’ hand after diverting the air force on May 18, 2001, the dispute was to be recalled.
Source:
Transliteration:
Reference: In response to so called rent riots in Chicago, Illinois, U.S.A., that occurred during the great depression of the 1930’s, city officials suspended evictions and arranged for some of the rioters to get work.
BPE: During the 1930’s, world economic downfall was raised in Chicko, U.S.A., with regards to rent accounts for local oppositions and some oppositions to function in the activities of the city.
Morfessor: In the 1930’s, the world grew up during a great depression in the nation of economic depression as a result of the Jerusalem crisis, contact with housebounds, enforced security officials from house to house, and oppositions stopped.
MorphoSeg: In the 1930’s, during the great depression in the world, an violence in the United States was formed in rebellion against houses, so the city authorities could stand up to ground troops to get jobs from their rent.
SPULM: During the 1930’s, an average of great economic breakthroughs in the world between China, U.S.A., U.S.A., and some staffs of the town’s authorities had influenced themselves to stop checking chores.
Word-Piece: During the 1930’s a great depression on the world’s economic depression, Missouri, U.S.A., U.S. home rebellion was issued, and some opposers promoted headquarters and opposed virtually.

References

[1]
Ben Allison, David Guthrie, and Louise Guthrie. 2006. Another look at the data sparsity problem. In Proceedings of the 9th International Conference on Text, Speech and Dialogue, TSD 2006, Brno, Czech Republic, September 11-15, 2006, Proceedings(Lecture Notes in Computer Science, Vol. 4188). Springer, 327–334.
[2]
Ali Araabi and Christof Monz. 2020. Optimizing Transformer for low-resource neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020. International Committee on Computational Linguistics, 3429–3435.
[3]
Duygu Ataman, Matteo Negri, Marco Turchi, and Marcello Federico. 2017. Linguistically motivated vocabulary reduction for neural machine translation from Turkish to English. Prague Bull. Math. Linguistics 108 (2017), 331–342. http://ufal.mff.cuni.cz/pbml/108/art-ataman-negri-turchi-federico.pdf
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1409.0473
[5]
Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieras, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Abbott Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóga, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud’hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, and Ekaterina Vylomova. 2022. UniMorph 4.0: Universal morphology. In Proceedings of the 13th Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022. European Language Resources Association, 840–855. https://aclanthology.org/2022.lrec-1.89
[6]
Kenneth R. Beesley and Lauri Karttunen. 2003. Finite-state morphology: Xerox tools and techniques. CSLI, Stanford (2003).
[7]
Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. 2016. Neural versus phrase-based machine translation quality: A case study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016. The Association for Computational Linguistics, 257–267.
[8]
Christian Bentz and Dimitrios Alikaniotis. 2016. The word entropy of natural languages. CoRR abs/1606.06996 (2016). arXiv:1606.06996. http://arxiv.org/abs/1606.06996
[9]
Christian Bentz, Dimitrios Alikaniotis, Michael Cysouw, and Ramon Ferrer-i-Cancho. 2017. The entropy of words - learnability and expressivity across more than 1000 languages. Entropy 19, 6 (2017), 275.
[10]
Steven Bird. 2006. NLTK: The natural language toolkit. In ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006. The Association for Computer Linguistics.
[11]
Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020(Findings of ACL, Vol. EMNLP 2020). Association for Computational Linguistics, 4617–4624.
[12]
Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluation the role of Bleu in machine translation research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2006, April 3-7, 2006, Trento, Italy. The Association for Computer Linguistics. https://aclanthology.org/E06-1032/
[13]
Sheila Castilho, Joss Moorkens, Federico Gaspari, Iacer Calixto, John Tinsley, and Andy Way. 2017. Is neural machine translation the new state of the art? Prague Bull. Math. Linguistics 108 (2017), 109–120. http://ufal.mff.cuni.cz/pbml/108/art-castilho-moorkens-gaspari-tinsley-calixto-way.pdf
[14]
Colin Cherry, George F. Foster, Ankur Bapna, Orhan Firat, and Wolfgang Macherey. 2018. Revisiting character-based neural machine translation with capacity and compression. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018. Association for Computational Linguistics, 4295–4305. https://aclanthology.org/D18-1461/
[15]
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014. Association for Computational Linguistics, 103–111.
[16]
Marta R. Costa-jussà and José A. R. Fonollosa. 2016. Character-based neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers. The Association for Computer Linguistics.
[17]
Ryan Cotterell, Arun Kumar, and Hinrich Schütze. 2016. Morphological segmentation inside-out. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. The Association for Computational Linguistics, 2325–2330.
[18]
Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. 4, 1 (2007), 3:1–3:34.
[19]
Raj Dabre, Anoop Kunchukuttan, Atsushi Fujita, and Eiichiro Sumita. 2018. NICT’s participation in WAT 2018: Approaches using multilingualism and recurrently stacked layers. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation, WAT@PACLIC 2018, Hong Kong, December 1-3, 2018. Association for Computational Linguistics. https://aclanthology.org/Y18-3003/
[20]
Michael J. Denkowski and Graham Neubig. 2017. Stronger baselines for trustable results in neural machine translation. In Proceedings of the 1st Workshop on Neural Machine Translation, NMT@ACL 2017, Vancouver, Canada, August 4, 2017. Association for Computational Linguistics, 18–27.
[21]
Prajit Dhar, Arianna Bisazza, and Gertjan van Noord. 2020. Linguistically motivated subwords for English-Tamil translation: University of Groningen’s submission to WMT-2020. In Proceedings of the 5th Conference on Machine Translation, WMT@EMNLP 2020, Online, November 19-20, 2020. Association for Computational Linguistics, 126–133. https://aclanthology.org/2020.wmt-1.9/
[22]
Shuoyang Ding, Adithya Renduchintala, and Kevin Duh. 2019. A call for prudent choice of subword merge operations in neural machine translation. In Proceedings of Machine Translation Summit XVII Volume 1: Research Track, MTSummit 2019, Dublin, Ireland, August 19-23, 2019. European Association for Machine Translation, 204–213. https://aclanthology.org/W19-6620/
[23]
Bonnie J. Dorr, Pamela W. Jordan, and John W. Benoit. 1999. A survey of current paradigms in machine translation. Adv. Comput. 49 (1999), 1–68.
[24]
Rotem Dror, Lotem Peled-Cohen, Segev Shlomov, and Roi Reichart. 2020. Statistical Significance Testing for Natural Language Processing. Morgan & Claypool Publishers.
[25]
Bradley Efron and Robert Tibshirani. 1993. An Introduction to the Bootstrap. Springer.
[26]
Ramy Eskander, Francesca Callejas, Elizabeth Nichols, Judith Klavans, and Smaranda Muresan. 2020. MorphAGram, evaluation and framework for unsupervised morphological segmentation. In Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020. European Language Resources Association, 7112–7122. https://aclanthology.org/2020.lrec-1.879/
[27]
Ray Fabri, Michael Gasser, Nizar Habash, George Kiraz, and Shuly Wintner. 2014. Linguistic introduction: The orthography, morphology and syntax of Semitic languages. In Natural Language Processing of Semitic Languages, Imed Zitouni (Ed.). Springer, 3–41.
[28]
Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, George F. Foster, Alon Lavie, and Ondrej Bojar. 2021. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the 6th Conference on Machine Translation, WMT@EMNLP 2021, Online Event, November 10-11, 2021. Association for Computational Linguistics, 733–774. https://aclanthology.org/2021.wmt-1.73
[29]
Philip Gage. 1994. A new algorithm for data compression. C Users Journal 12, 2 (1994), 23–38.
[30]
Michael Gasser. 2011. HornMorpho: A system for morphological processing of Amharic, Oromo, and Tigrinya. In Conference on Human Language Technology for Development, Alexandria, Egypt.
[31]
Andargachew Mekonnen Gezmu, Andreas Nürnberger, and Tesfaye Bayu Bati. 2021. Neural machine translation for Amharic-English translation. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence, ICAART 2021, Volume 1, Online Streaming, February 4-6, 2021. SCITEPRESS, 526–532.
[32]
Andargachew Mekonnen Gezmu, Andreas Nürnberger, and Tesfaye Bayu Bati. 2022. Extended parallel corpus for Amharic-English machine translation. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). European Language Resources Association, Marseille, France, 6644–6653. http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.716.pdf
[33]
Andargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Michael Gasser, and Andreas Nürnberger. 2018. Contemporary Amharic corpus: Automatically morpho-syntactically tagged Amharic corpus. In Proceedings of the 1st Workshop on Linguistic Resources for Natural Language Processing. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 65–70. https://aclanthology.org/W18-3809
[34]
Thamme Gowda and Jonathan May. 2020. Finding the optimal vocabulary size for neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020(Findings of ACL, Vol. EMNLP 2020). Association for Computational Linguistics, 3955–3964.
[35]
Vikrant Goyal, Sourav Kumar, and Dipti Misra Sharma. 2020. Efficient neural machine translation for low-resource languages via exploiting related languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 162–168.
[36]
Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O. K. Li. 2018. Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers). Association for Computational Linguistics, 344–354.
[37]
Ximena Gutierrez-Vasques, Christian Bentz, Olga Sozinova, and Tanja Samardzic. 2021. From characters to words: The turning point of BPE merges. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19-23, 2021. Association for Computational Linguistics, 3454–3468.
[38]
Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindrich Helcl, and Alexandra Birch. 2022. Survey of low-resource machine translation. Comput. Linguistics 48, 3 (2022), 673–732.
[39]
Martin Haspelmath. 2007. Pre-established categories don’t exist: Consequences for language description and typology. Linguistic Typology 11, 1 (2007), 119–132.
[40]
Matthias Huck, Simon Riess, and Alexander M. Fraser. 2017. Target-side word segmentation strategies for neural machine translation. In Proceedings of the 2nd Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017. Association for Computational Linguistics, 56–67.
[41]
Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers. The Association for Computer Linguistics, 1–10.
[42]
Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., USA.
[43]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980
[44]
Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, and Arul Menezes. 2021. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the 6th Conference on Machine Translation, WMT@EMNLP 2021, Online Event, November 10-11, 2021. Association for Computational Linguistics, 478–494. https://aclanthology.org/2021.wmt-1.57
[45]
Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25-26 July 2004, Barcelona, Spain. ACL, 388–395. https://aclanthology.org/W04-3250/
[46]
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, ACL 2007, June 23-30, 2007, Prague, Czech Republic. The Association for Computational Linguistics. https://aclanthology.org/P07-2045/
[47]
Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the 1st Workshop on Neural Machine Translation, NMT@ACL 2017, Vancouver, Canada, August 4, 2017. Association for Computational Linguistics, 28–39.
[48]
Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers. Association for Computational Linguistics, 66–75.
[49]
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018. Association for Computational Linguistics, 66–71.
[50]
Seamus Lankford, Haithem Alfi, and Andy Way. 2021. Transformers for low-resource languages: Is Féidir linn!. In Proceedings of the 18th Biennial Machine Translation Summit - Volume 1: Research Track, MTSummit 2021 Virtual, August 16-20, 2021. Association for Machine Translation in the Americas, 48–60. https://aclanthology.org/2021.mtsummit-research.5
[51]
Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level neural machine translation without explicit segmentation. Trans. Assoc. Comput. Linguistics 5 (2017), 365–378.
[52]
Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol Vinyals, and Wojciech Zaremba. 2015. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers. The Association for Computer Linguistics, 11–19.
[53]
Dominik Machácek, Jonás Vidra, and Ondrej Bojar. 2018. Morphological and language-agnostic word segmentation for NMT. In Text, Speech, and Dialogue - 21st International Conference, TSD 2018, Brno, Czech Republic, September 11-14, 2018, Proceedings(Lecture Notes in Computer Science, Vol. 11107). Springer, 277–284.
[54]
Benjamin Marie, Atsushi Fujita, and Raphael Rubino. 2021. Scientific credibility of machine translation research: A meta-evaluation of 769 papers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021. Association for Computational Linguistics, 7297–7306.
[55]
Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot, and Samson Tan. 2021. Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP. CoRR abs/2112.10508. (2021). arXiv:2112.10508https://arxiv.org/abs/2112.10508
[56]
John E. Ortega, Richard Castro Mamani, and Kyunghyun Cho. 2020. Neural machine translation with a polysynthetic low resource language. Mach. Transl. 34, 4 (2020), 325–346.
[57]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. ACL, 311–318.
[58]
Martin Popel and Ondrej Bojar. 2018. Training tips for the Transformer model. Prague Bull. Math. Linguistics 110 (2018), 43–70. http://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf
[59]
Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the 3rd Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018. Association for Computational Linguistics, 186–191.
[60]
Ricardo Rei, Craig Stewart, Ana C. Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020. Association for Computational Linguistics, 2685–2702.
[61]
Ehud Reiter. 2018. A structured review of the validity of BLEU. Comput. Linguistics 44, 3 (2018).
[62]
Jorma Rissanen. 1998. Stochastic Complexity in Statistical Inquiry. World Scientific Series in Computer Science, Vol. 15. World Scientific.
[63]
Elizabeth Salesky, Andrew Runge, Alex Coda, Jan Niehues, and Graham Neubig. 2020. Optimizing segmentation granularity for neural machine translation. Mach. Transl. 34, 1 (2020), 41–59.
[64]
Jonne Sälevä and Constantine Lignos. 2021. The effectiveness of morphology-aware segmentation in low-resource neural machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, EACL 2021, Online, April 19-23, 2021. Association for Computational Linguistics, 164–174.
[65]
Paul A. Samuelson. 1937. A note on measurement of utility. The Review of Economic Studies 4, 2 (1937), 155–161.
[66]
Víctor M. Sánchez-Cartagena, Juan Antonio Pérez-Ortiz, and Felipe Sánchez-Martínez. 2019. The Universitat d’Alacant submissions to the English-to-Kazakh news translation task at WMT 2019. In Proceedings of the 4th Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1. Association for Computational Linguistics, 356–363.
[67]
Mike Schuster and Kaisuke Nakajima. 2012. Japanese and Korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012, Kyoto, Japan, March 25-30, 2012. IEEE, 5149–5152.
[68]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
[69]
Rico Sennrich and Biao Zhang. 2019. Revisiting low-resource neural machine translation: A case study. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28-August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 211–221.
[70]
Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 3 (1948), 379–423.
[71]
Peter Smit, Sami Virpioja, Stig-Arne Grönroos, and Mikko Kurimo. 2014. Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, April 26-30, 2014, Gothenburg, Sweden. The Association for Computer Linguistics, 21–24.
[72]
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958. http://dl.acm.org/citation.cfm?id=2670313
[73]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13, 2014, Montreal, Quebec, Canada. 3104–3112. https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html
[74]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2818–2826.
[75]
Antonio Toral, Lukas Edman, Galiya Yeshmagambetova, and Jennifer Spenader. 2019. Neural machine translation for English-Kazakh with morphological segmentation and synthetic data. In Proceedings of the 4th Conference on Machine Translation, WMT 2019, Florence, Italy, August 1-2, 2019 - Volume 2: Shared Task Papers, Day 1. Association for Computational Linguistics, 386–392.
[76]
Ashish Vaswani, Samy Bengio, Eugene Brevdo, François Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Lukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2Tensor for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas, AMTA 2018, Boston, MA, USA, March 17-21, 2018 - Volume 1: Research Papers. Association for Machine Translation in the Americas, 193–199. https://aclanthology.org/W18-1819/
[77]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[78]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). arXiv:1609.08144. http://arxiv.org/abs/1609.08144
[79]
Jingjing Xu, Hao Zhou, Chun Gan, Zaixiang Zheng, and Lei Li. 2021. Vocabulary learning via optimal transport for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021. Association for Computational Linguistics, 7361–7373.
[80]
Janis Zuters, Gus Strazds, and Karlis Immers. 2018. Semi-automatic quasi-morphological word segmentation for neural machine translation. In Databases and Information Systems - 13th International Baltic Conference, DB&IS 2018, Trakai, Lithuania, July 1-4, 2018, Proceedings(Communications in Computer and Information Science, Vol. 838). Springer, 289–301.

Index Terms

  1. Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion Languages

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 9
    September 2023
    226 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3625383
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 September 2023
    Online AM: 28 July 2023
    Accepted: 14 July 2023
    Revised: 16 May 2023
    Received: 05 July 2022
    Published in TALLIP Volume 22, Issue 9

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Neural machine translation
    2. morpheme-based word segmentation
    3. fusion languages
    4. low-resource languages
    5. transformers

    Qualifiers

    • Note

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 1,299
      Total Downloads
    • Downloads (Last 12 months)980
    • Downloads (Last 6 weeks)139
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media