8000 GitHub - castorini/pygaggle at fix-t5
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini

License

Notifications You must be signed in to change notification settings

castorini/pygaggle

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyGaggle

PyPI LICENSE

PyGaggle provides a gaggle of deep neural architectures for text ranking and question answering. It was designed for tight integration with Pyserini, but can be easily adapted for other sources as well.

Currently, this repo contains implementations of the rerankers for CovidQA on CORD-19, as described in "Rapidly Bootstrapping a Question Answering Dataset for COVID-19".

Installation

  1. For pip, do pip install pygaggle. If you prefer Anaconda, use conda env create -f environment.yml && conda activate pygaggle.

  2. Install PyTorch 1.4+.

  3. Download the index: sh scripts/update-index.sh.

  4. Make sure you have an installation of Java 11+: javac --version.

  5. Install Anserini.

Running rerankers on CovidQA

By default, the script uses data/lucene-index-covid-paragraph for the index path. If this is undesirable, set the environment variable CORD19_INDEX_PATH to the path of the index. For a full list of mostly self-explanatory environment variables, see this file.

BM25 uses the CPU. If you don't have a GPU for the transformer models, pass --device cpu (PyTorch device string format) to the script.

Unsupervised Methods

BM25:

python -um pygaggle.run.evaluate_kaggle_highlighter --method bm25

BERT:

python -um pygaggle.run.evaluate_kaggle_highlighter --method transformer --model-name bert-base-cased

SciBERT:

python -um pygaggle.run.evaluate_kaggle_highlighter --method transformer --model-name allenai/scibert_scivocab_cased

BioBERT:

python -um pygaggle.run.evaluate_kaggle_highlighter --method transformer --model-name biobert

Supervised Methods

T5 (fine-tuned on MS MARCO):

python -um pygaggle.run.evaluate_kaggle_highlighter --method t5

BioBERT (fine-tuned on SQuAD v1.1):

  1. Download the weights, vocab, and config from the BioBERT repository to the same folder.

  2. Rename the following files in the folder:

mv bert_config.json config.json
for filename in model.ckpt*; do 
    mv $filename $(python -c "import re; print(re.sub(r'ckpt-\\d+', 'ckpt', '$filename'))");
done
  1. Evaluate the model:
python -um pygaggle.run.evaluate_kaggle_highlighter --method qa_transformer --model-name <folder path>

BioBERT (fine-tuned on MS MARCO):

  1. Download the weights, vocab, and config from our Google Storage bucket. This requires an installation of gsutil.
mkdir biobert-marco && cd biobert-marco
gsutil cp "gs://neuralresearcher_data/doc2query/experiments/exp374/model.ckpt-100000*" .
gsutil cp gs://neuralresearcher_data/biobert_models/biobert_v1.1_pubmed/bert_config.json config.json
gsutil cp gs://neuralresearcher_data/biobert_models/biobert_v1.1_pubmed/vocab.txt .
  1. Rename the files:
for filename in model.ckpt*; do 
    mv $filename $(python -c "import re; print(re.sub(r'ckpt-\\d+', 'ckpt', '$filename'))");
done
  1. Evaluate the model:
python -um pygaggle.run.evaluate_kaggle_highlighter --method seq_class_transformer --model-name <folder path>

About

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini

Resources

License

Stars

Watchers

Forks

Contributors 48

0