MSMARCO support: monoBERT #14

ronakice · 2020-04-29T15:48:23Z

Tested on a subset of 50 questions (1000 passages each) on colab (P100): https://colab.research.google.com/drive/1C-0U40wCUbDUObWReBYkeIBLgqJeKCuO

It takes about 16-17 minutes. So can potentially scale it up to 1000ish examples when having URAs replicate it. I will also entirely replicate the results entirely across multiple GPUs/TPUs as and when I add support for it!

m̶r̶r̶@̶1̶0̶ ̶0̶.̶3̶6̶4̶7̶6̶9̶8̶4̶1̶2̶6̶9̶8̶4̶1̶3̶ ̶v̶s̶ ̶@̶r̶o̶d̶r̶i̶g̶o̶n̶o̶g̶u̶e̶i̶r̶a̶4̶'̶s̶ ̶r̶u̶n̶.̶m̶o̶n̶o̶b̶e̶r̶t̶.̶d̶e̶v̶.̶s̶m̶a̶l̶l̶.̶t̶s̶v̶ ̶s̶c̶o̶r̶i̶n̶g̶ ̶m̶r̶r̶@̶1̶0̶:̶ ̶0̶.̶3̶5̶7̶6̶9̶0̶4̶7̶6̶1̶9̶0̶4̶7̶6̶3̶ ̶o̶n̶ ̶t̶h̶e̶s̶e̶ ̶5̶0̶ ̶q̶u̶e̶s̶t̶i̶o̶n̶s̶.̶ ̶(there was something wrong with the sample, reinvestigating now) This is potentially due to the tokenization difference (here we just use encode_plus that follows ’longest_first’ truncation strategy vs fixed length tokenization of 64 tokens for the query and 448 for the passage). It could also be due to a bunch of other things as we saw similar differences in numbert.

Some other important general changes I made:

python>=3.6 instead of 3.7 (Colab right now uses 3.6).
Refactored Settings (in settings.py).
Added type Optional[str] field id to base class Query.
Update to newer transformers/tokenizers

setup.py

pygaggle/data/msmarco.py

rodrigonogueira4 · 2020-04-29T18:04:15Z

@ronakice, thanks a lot for adding this!

Just curious, what is inference speed (query-doc per sec) you got on GPU vs TPU?

ronakice · 2020-04-29T19:41:42Z

@rodrigonogueira4 right now pygaggle only supports single GPU I believe. Yes, we should add TPU and multi-GPU support soon and benchmark it! I'm prioritizing monoBERT/duoBERT/T5 for both MARCO and TREC first which the URAs could easily run (since they only have single GPU access).

pygaggle/data/msmarco.py

pygaggle/data/unicode.py

pygaggle/rerank/base.py

rodrigonogueira4 · 2020-04-29T22:04:18Z

pygaggle/settings.py

+
+
+class MsMarcoSettings(Settings):
+    msmarco_index_path: str = '/content/data/index-msmarco-passage-20191117-0ed488'


Should we convert all these as arguments to our main functions? It is hard to keep track of which exact command we used to run an experiment when they are in the code

Hmm, I just followed similar settings as that which was included for CovidQA. Maybe things like the index_path which rarely every change can remain here, and we can move model directory to main?

rodrigonogueira4 · 2020-04-29T22:04:41Z

setup.py

    'tqdm==4.45.0',
-    'transformers==2.7.0'
+    'transformers>=2.7.0'


can we force 2.8.0 already?

My part of the code can, I don't know if it will cause issues in the rest (I don't think so..)

Yeah.. lets try to move to 2.8.0 as it contains some important new functions for T5

setup.py

rodrigonogueira4

Thanks again for implementing this! Just a couple of minor comments

rodrigonogueira4

Looks all good to me and thanks again for implementing this!

Feel free to merge when you finish converting settings.py as arguments.

ronakice added 8 commits April 29, 2020 16:52

add monobert for marco

a43e40d

temp change python3.7 to 3.6 for colab compatibility

705e8c3

fix evaluation options

70b0457

fix issues

4d76d73

add missing options in evaluate_passage_ranker

d5a469d

working monobert

9c37c01

update transformers, clean code

2a1051e

update tokenizers

7243d93

ronakice requested review from rodrigonogueira4, daemon and lintool April 29, 2020 15:48

daemon reviewed Apr 29, 2020

View reviewed changes

setup.py Show resolved Hide resolved

add dataclasses if < 3.7

c84cba3

daemon reviewed Apr 29, 2020

View reviewed changes

pygaggle/data/msmarco.py Outdated Show resolved Hide resolved

cleanup todos

fd0a584