8000 GitHub - anidhi/flores: Facebook Low Resource (FLoRes) MT Benchmark
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

anidhi/flores

 
 

FLoRes Low Resource MT Benchmark

This repository contains data and baselines from the paper:
The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English.

The data can be downloaded directly at:
https://github.com/facebookresearch/flores/raw/master/data/wikipedia_en_ne_si_test_sets.tgz

Baselines

The following instructions will can be used to reproduce the baseline results from the paper.

Requirements

The baseline uses the Indic NLP Library and sentencepiece for preprocessing; fairseq for model training; and sacrebleu for scoring.

Dependencies can be installed via pip:

$ pip install fairseq sacrebleu sentencepiece

The Indic NLP Library will be cloned automatically by the prepare-{ne,si}en.sh scripts.

Download and preprocess data

The download-data.sh script can be used to download and extract the raw data. Thereafter the prepare-neen.sh and prepare-sien.sh scripts can be used to preprocess the raw data. In particular, they will use the sentencepiece library to learn a shared BPE vocabulary with 5000 subword units and binarize the data for training with fairseq.

To download and extract the raw data:

$ bash download-data.sh

Thereafter, run the following to preprocess the raw data:

$ bash prepare-neen.sh
$ bash prepare-sien.sh

Train a baseline Transformer model

To train a baseline Ne-En model on a single GPU:

$ CUDA_VISIBLE_DEVICES=0 fairseq-train \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --arch transformer --share-all-embeddings \
    --encoder-layers 5 --decoder-layers 5 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
    --encoder-attention-heads 2 --decoder-attention-heads 2 \
    --encoder-normalize-before --decoder-normalize-before \
    --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \
    --weight-decay 0.0001 \
    --label-smoothing 0.2 --criterion label_smoothed_cross_entropy \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 --min-lr 1e-9 \
    --max-tokens 4000 \
    --update-freq 4 \
    --max-epoch 100 --save-interval 10

To train on 4 GPUs, remove the --update-freq flag and run CUDA_VISIBLE_DEVICES=0,1,2, 8274 3 fairseq-train (...). If you have a Volta or newer GPU you can further improve training speed by adding the --fp16 flag.

This same architecture can be used for En-Ne, Si-En and En-Si:

  • For En-Ne, update the training command with:
    fairseq-train data-bin/wiki_ne_en_bpe5000 --source-lang en --target-lang ne
  • For Si-En, update the training command with:
    fairseq-train data-bin/wiki_si_en_bpe5000 --source-lang si --target-lang en
  • For En-Si, update the training command with:
    fairseq-train data-bin/wiki_si_en_bpe5000 --source-lang en --target-lang si

Compute BLEU using sacrebleu

Run beam search generation and scoring with sacrebleu:

$ fairseq-generate \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --path checkpoints/checkpoint_best.pt \
    --beam 5 --lenpen 1.2 \
    --gen-subset valid \
    --remove-bpe=sentencepiece \
    --sacrebleu

Replace --gen-subset valid with --gen-subset test above to score the test set.

Tokenized BLEU for En-Ne and En-Si:

For these language pairs we report tokenized BLEU. You can compute tokenized BLEU by removing the --sacrebleu flag from generate.py:

$ fairseq-generate \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang en --target-lang ne \
    --path checkpoints/checkpoint_best.pt \
    --beam 5 --lenpen 1.2 \
    --gen-subset valid \
    --remove-bpe=sentencepiece

Train iterative back-translation models

After runing the commands in Download and preprocess data section above, run the following to download and preprocess the monolingual data:

$ bash prepare-monolingual.sh

To train the iterative back-translation for two iterations on Ne-En, run the following:

$ bash reproduce.sh ne_en

The script will train an Ne-En supervised model, translate Nepali monolingual data, train En-Ne back-translation iteration 1 model, translate English monolingual data back to Nepali, and train Ne-En back-translation iteration 2 model. All the model training and data generation happen locally. The script uses all the GPUs listed in CUDA_VISIBLE_DEVICES variable unless certain cuda device ids are specified to train.py, and it is designed to adjust the hyper-parameters according to the number of available GPUs. With 8 Tesla V100 GPUs, the full pipeline takes about 25 hours to finish. We expect the final BT iteration 2 Ne-En model achieves around 15.9 (sacre)BLEU score on devtest set. The script supports ne_en, en_ne, si_en and en_si directions.

Citation

If you use this data in your work, please cite:

@inproceedings{,
  title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
  author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1902.01382},
  year={2019}
}

Changelog

  • 2019-11-04: Add config to reproduce iterative back-translation result on Sinhala-English and English-Sinhala
  • 2019-10-23: Add script to reproduce iterative back-translation result on Nepali-English and English-Nepali
  • 2019-10-18: Add final test set
  • 2019-05-20: Remove extra carriage return character from Nepali-English parallel dataset.
  • 2019-04-18: Specify the linebreak character in the sentencepiece encoding script to fix small portion of misaligned parallel sentences in Nepali-English parallel dataset.
  • 2019-03-08: Update tokenizer script to make it compatible with previous version of indic_nlp.
  • 2019-02-14: Update dataset preparation script to avoid unexpected extra line being added to each paralel dataset.

License

The dataset is licenced under CC-BY-SA, see the LICENSE file for details.

About

Facebook Low Resource (FLoRes) MT Benchmark

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 51.5%
  • Shell 48.5%
0