8000 GitHub - naist-nlp/mbrs: A library for minimum Bayes risk (MBR) decoding
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

naist-nlp/mbrs

Repository files navigation

mbrs is a library for minimum Bayes risk (MBR) decoding.

PyPi GitHub

Paper | Reference docs | Citation | Release notes

Installation

You can install from PyPi:

pip install mbrs

For developers, it can be installed from the source.

git clone https://github.com/naist-nlp/mbrs.git
cd mbrs/
pip install ./

For uv users:

git clone https://github.com/naist-nlp/mbrs.git
cd mbrs/
uv sync

Quick start

mbrs provides two interfaces: command-line interface (CLI) and Python API.

Command-line interface

Command-line interface can run MBR decoding from command-line. Before running MBR decoding, you can generate hypothesis sentences with mbrs-generate:

mbrs-generate \
  sources.txt \
  --output hypotheses.txt \
  --lang_pair en-de \
  --model facebook/m2m100_418M \
  --num_candidates 1024 \
  --sampling eps --epsilon 0.02 \
  --batch_size 8 --sampling_size 8 --fp16 \
  --report_format rounded_outline

Beam search can also be used by replacing --sampling eps --epsilon 0.02 with --beam_size 10.

Next, MBR decoding and other decoding methods can be executed with mbrs-decode. This example regards the hypothesis set as the pseudo-reference set.

mbrs-decode \
  hypotheses.txt \
  --num_candidates 1024 \
  --nbest 1 \
  --source sources.txt \
  --references hypotheses.txt \
  --output translations.txt \
  --report report.txt --report_format rounded_outline \
  --decoder mbr \
  --metric comet \
  --metric.model Unbabel/wmt22-comet-da \
  --metric.batch_size 64 --metric.fp16 true

You can pass the arguments using a configuration yaml file via --config_path option. See docs for the details.

Finally, you can evaluate the score with mbrs-score:

mbrs-score \
  hypotheses.txt \
  --sources sources.txt \
  --references hypotheses.txt \
  --format json \
  --metric bleurt \
  --metric.batch_size 64 --metric.fp16 true

Python API

This is the example of COMET-MBR via Python API.

from mbrs.metrics import MetricCOMET
from mbrs.decoders import DecoderMBR

SOURCE = "ありがとう"
HYPOTHESES = ["Thanks", "Thank you", "Thank you so much", "Thank you.", "thank you"]

# Setup COMET.
metric_cfg = MetricCOMET.Config(
  model="Unbabel/wmt22-comet-da",
  batch_size=64,
  fp16=True,
)
metric = MetricCOMET(metric_cfg)

# Setup MBR decoding.
decoder_cfg = DecoderMBR.Config()
decoder = DecoderMBR(decoder_cfg, metric)

# Decode by COMET-MBR.
# This example regards the hypotheses themselves as the pseudo-references.
# Args: (hypotheses, pseudo-references, source)
output = decoder.decode(HYPOTHESES, HYPOTHESES, source=SOURCE, nbest=1)

print(f"Selected index: {output.idx}")
print(f"Output sentence: {output.sentence}")
print(f"Expected score: {output.score}")

List of implemented methods

Metrics

Currently, the following metrics are supported:

Decoders

The following decoding methods are implemented:

  • N-best reranking: rerank
  • MBR decoding: mbr

Specifically, the following methods of MBR decoding are included:

Selectors

The final output list is selected according to these selectors:

Related projects

  • mbr
    • Highly integrated with huggingface transformers by customizing generate() method of model implementation.
    • If you are looking for an MBR decoding library that is fully integrated into transformers, this might be a good choice.
    • Our mbrs works standalone; thus, not only transformers but also fairseq or LLM outputs via API can be used.

Citation

If you use this software, please cite:

@inproceedings{deguchi-etal-2024-mbrs,
    title = "mbrs: A Library for Minimum {B}ayes Risk Decoding",
    author = "Deguchi, Hiroyuki  and
      Sakai, Yusuke  and
      Kamigaito, Hidetaka  and
      Watanabe, Taro",
    editor = "Hernandez Farias, Delia Irazu  and
      Hope, Tom  and
      Li, Manling",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-demo.37",
    pages = "351--362",
}

License

This library is mainly developed by Hiroyuki Deguchi and published under the MIT-license.

0