This repository contains data and baselines from the paper:
The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English.
The data can be downloaded directly at:
https://github.com/facebookresearch/flores/raw/master/data/wikipedia_en_ne_si_test_sets.tgz
The following instructions will can be used to reproduce the baseline results from the paper.
The baseline uses the Indic NLP Library and sentencepiece for preprocessing; fairseq for model training; and sacrebleu for scoring.
Dependencies can be installed via pip:
$ pip install fairseq sacrebleu sentencepiece
The Indic NLP Library will be cloned automatically by the prepare-{ne,si}en.sh
scripts.
The download-data.sh
script can be used to download and extract the raw data.
Thereafter the prepare-neen.sh
and prepare-sien.sh
scripts can be used to
preprocess the raw data. In particular, they will use the sentencepiece library
to learn a shared BPE vocabulary with 5000 subword units and binarize the data
for training with fairseq.
To download and extract the raw data:
$ bash download-data.sh
Thereafter, run the following to preprocess the raw data:
$ bash prepare-neen.sh
$ bash prepare-sien.sh
To train a baseline Ne-En model on a single GPU:
$ CUDA_VISIBLE_DEVICES=0 fairseq-train \
data-bin/wiki_ne_en_bpe5000/ \
--source-lang ne --target-lang en \
--arch transformer --share-all-embeddings \
--encoder-layers 5 --decoder-layers 5 \
--encoder-embed-dim 512 --decoder-embed-dim 512 \
--encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
--encoder-attention-heads 2 --decoder-attention-heads 2 \
--encoder-normalize-before --decoder-normalize-before \
--dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \
--weight-decay 0.0001 \
--label-smoothing 0.2 --criterion label_smoothed_cross_entropy \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \
--lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
--lr 1e-3 --min-lr 1e-9 \
--max-tokens 4000 \
--update-freq 4 \
--max-epoch 100 --save-interval 10
To train on 4 GPUs, remove the --update-freq
flag and run CUDA_VISIBLE_DEVICES=0,1,2,
8274
3 fairseq-train (...)
.
If you have a Volta or newer GPU you can further improve training speed by adding the --fp16
flag.
This same architecture can be used for En-Ne, Si-En and En-Si:
- For En-Ne, update the training command with:
fairseq-train data-bin/wiki_ne_en_bpe5000 --source-lang en --target-lang ne
- For Si-En, update the training command with:
fairseq-train data-bin/wiki_si_en_bpe5000 --source-lang si --target-lang en
- For En-Si, update the training command with:
fairseq-train data-bin/wiki_si_en_bpe5000 --source-lang en --target-lang si
Run beam search generation and scoring with sacrebleu:
$ fairseq-generate \
data-bin/wiki_ne_en_bpe5000/ \
--source-lang ne --target-lang en \
--path checkpoints/checkpoint_best.pt \
--beam 5 --lenpen 1.2 \
--gen-subset valid \
--remove-bpe=sentencepiece \
--sacrebleu
Replace --gen-subset valid
with --gen-subset test
above to score the test set.
Tokenized BLEU for En-Ne and En-Si:
For these language pairs we report tokenized BLEU. You can compute tokenized BLEU by removing the --sacrebleu
flag
from generate.py:
$ fairseq-generate \
data-bin/wiki_ne_en_bpe5000/ \
--source-lang en --target-lang ne \
--path checkpoints/checkpoint_best.pt \
--beam 5 --lenpen 1.2 \
--gen-subset valid \
--remove-bpe=sentencepiece
After runing the commands in Download and preprocess data section above, run the following to download and preprocess the monolingual data:
$ bash prepare-monolingual.sh
To train the iterative back-translation for two iterations on Ne-En, run the following:
$ bash reproduce.sh ne_en
The script will train an Ne-En supervised model, translate Nepali monolingual data, train En-Ne back-translation iteration 1 model, translate English monolingual data back to Nepali, and train Ne-En back-translation iteration 2 model. All the model training and data generation happen locally. The script uses all the GPUs listed in CUDA_VISIBLE_DEVICES
variable unless certain cuda device ids are specified to train.py
, and it is designed to adjust the hyper-parameters according to the number of available GPUs. With 8 Tesla V100 GPUs, the full pipeline takes about 25 hours to finish. We expect the final BT iteration 2 Ne-En model achieves around 15.9 (sacre)BLEU score on devtest set. The script supports ne_en
, en_ne
, si_en
and en_si
directions.
If you use this data in your work, please cite:
@inproceedings{,
title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
journal={arXiv preprint arXiv:1902.01382},
year={2019}
}
- 2019-11-04: Add config to reproduce iterative back-translation result on Sinhala-English and English-Sinhala
- 2019-10-23: Add script to reproduce iterative back-translation result on Nepali-English and English-Nepali
- 2019-10-18: Add final test set
- 2019-05-20: Remove extra carriage return character from Nepali-English parallel dataset.
- 2019-04-18: Specify the linebreak character in the sentencepiece encoding script to fix small portion of misaligned parallel sentences in Nepali-English parallel dataset.
- 2019-03-08: Update tokenizer script to make it compatible with previous version of indic_nlp.
- 2019-02-14: Update dataset preparation script to avoid unexpected extra line being added to each paralel dataset.
The dataset is licenced under CC-BY-SA, see the LICENSE file for details.