FLoRes Low Resource MT Benchmark

This repository contains data and baselines from the paper:
The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English.

The data can be downloaded directly at:
https://github.com/facebookresearch/flores/raw/master/data/wikipedia_en_ne_si_test_sets.tgz

Baselines

The following instructions will can be used to reproduce the baseline results from the paper.

Requirements

The baseline uses the Indic NLP Library and sentencepiece for preprocessing; fairseq for model training; and sacrebleu for scoring.

Dependencies can be installed via pip:

$ pip install fairseq sacrebleu sentencepiece

The Indic NLP Library will be cloned automatically by the prepare-{ne,si}en.sh scripts.

Download and preprocess data

The download-data.sh script can be used to download and extract the raw data. Thereafter the prepare-neen.sh and prepare-sien.sh scripts can be used to preprocess the raw data. In particular, they will use the sentencepiece library to learn a shared BPE vocabulary with 5000 subword units and binarize the data for training with fairseq.

To download and extract the raw data:

$ bash download-data.sh

Thereafter, run the following to preprocess the raw data:

$ bash prepare-neen.sh
$ bash prepare-sien.sh

Train a baseline Transformer model

To train a baseline Ne-En model on a single GPU:

$ CUDA_VISIBLE_DEVICES=0 fairseq-train \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --arch transformer --share-all-embeddings \
    --encoder-layers 5 --decoder-layers 5 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
    --encoder-attention-heads 2 --decoder-attention-heads 2 \
    --encoder-normalize-before --decoder-normalize-before \
    --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \
    --weight-decay 0.0001 \
    --label-smoothing 0.2 --criterion label_smoothed_cross_entropy \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 --min-lr 1e-9 \
    --max-tokens 4000 \
    --update-freq 4 \
    --max-epoch 100 --save-interval 10

To train on 4 GPUs, remove the --update-freq flag and run CUDA_VISIBLE_DEVICES=0,1,2, 8274 3 fairseq-train (...). If you have a Volta or newer GPU you can further improve training speed by adding the --fp16 flag.

This same architecture can be used for En-Ne, Si-En and En-Si:

For En-Ne, update the training command with:
fairseq-train data-bin/wiki_ne_en_bpe5000 --source-lang en --target-lang ne
For Si-En, update the training command with:
fairseq-train data-bin/wiki_si_en_bpe5000 --source-lang si --target-lang en
For En-Si, update the training command with:
fairseq-train data-bin/wiki_si_en_bpe5000 --source-lang en --target-lang si

Compute BLEU using sacrebleu

Run beam search generation and scoring with sacrebleu:

$ fairseq-generate \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --path checkpoints/checkpoint_best.pt \
    --beam 5 --lenpen 1.2 \
    --gen-subset valid \
    --remove-bpe=sentencepiece \
    --sacrebleu

Replace --gen-subset valid with --gen-subset test above to score the test set.

Tokenized BLEU for En-Ne and En-Si:

For these language pairs we report tokenized BLEU. You can compute tokenized BLEU by removing the --sacrebleu flag from generate.py:

$ fairseq-generate \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang en --target-lang ne \
    --path checkpoints/checkpoint_best.pt \
    --beam 5 --lenpen 1.2 \
    --gen-subset valid \
    --remove-bpe=sentencepiece

Train iterative back-translation models

After runing the commands in Download and preprocess data section above, run the following to download and preprocess the monolingual data:

$ bash prepare-monolingual.sh

To train the iterative back-translation for two iterations on Ne-En, run the following:

$ bash reproduce.sh ne_en

The script will train an Ne-En supervised model, translate Nepali monolingual data, train En-Ne back-translation iteration 1 model, translate English monolingual data back to Nepali, and train Ne-En back-translation iteration 2 model. All the model training and data generation happen locally. The script uses all the GPUs listed in CUDA_VISIBLE_DEVICES variable unless certain cuda device ids are specified to train.py, and it is designed to adjust the hyper-parameters according to the number of available GPUs. With 8 Tesla V100 GPUs, the full pipeline takes about 25 hours to finish. We expect the final BT iteration 2 Ne-En model achieves around 15.9 (sacre)BLEU score on devtest set. The script supports ne_en, en_ne, si_en and en_si directions.

Citation

If you use this data in your work, please cite:

@inproceedings{,
  title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
  author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1902.01382},
  year={2019}
}

Changelog

2019-11-04: Add config to reproduce iterative back-translation result on Sinhala-English and English-Sinhala
2019-10-23: Add script to reproduce iterative back-translation result on Nepali-English and English-Nepali
2019-10-18: Add final test set
2019-05-20: Remove extra carriage return character from Nepali-English parallel dataset.
2019-04-18: Specify the linebreak character in the sentencepiece encoding script to fix small portion of misaligned parallel sentences in Nepali-English parallel dataset.
2019-03-08: Update tokenizer script to make it compatible with previous version of indic_nlp.
2019-02-14: Update dataset preparation script to avoid unexpected extra line being added to each paralel dataset.

License

The dataset is licenced under CC-BY-SA, see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FLoRes Low Resource MT Benchmark

Baselines

Requirements

Download and preprocess data

Train a baseline Transformer model

Compute BLEU using sacrebleu

Train iterative back-translation models

Citation

Changelog

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
configs		configs
data		data
scripts		scripts
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
download-data.sh		download-data.sh
prepare-monolingual.sh		prepare-monolingual.sh
prepare-neen.sh		prepare-neen.sh
prepare-sien.sh		prepare-sien.sh
reproduce.sh		reproduce.sh

License

anidhi/flores

Folders and files

Latest commit

History

Repository files navigation

FLoRes Low Resource MT Benchmark

Baselines

Requirements

Download and preprocess data

Train a baseline Transformer model

Compute BLEU using sacrebleu

Train iterative back-translation models

Citation

Changelog

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages