8000 MSMARCO support: monoBERT by ronakice · Pull Request #14 · castorini/pygaggle · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

MSMARCO support: monoBERT #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 30, 2020
Merged

MSMARCO support: monoBERT #14

merged 14 commits into from
Apr 30, 2020

Conversation

ronakice
Copy link
Member
@ronakice ronakice commented Apr 29, 2020

Tested on a subset of 50 questions (1000 passages each) on colab (P100): https://colab.research.google.com/drive/1C-0U40wCUbDUObWReBYkeIBLgqJeKCuO

It takes about 16-17 minutes. So can potentially scale it up to 1000ish examples when having URAs replicate it. I will also entirely replicate the results entirely across multiple GPUs/TPUs as and when I add support for it!

m̶r̶r̶@̶1̶0̶ ̶0̶.̶3̶6̶4̶7̶6̶9̶8̶4̶1̶2̶6̶9̶8̶4̶1̶3̶ ̶v̶s̶ ̶@̶r̶o̶d̶r̶i̶g̶o̶n̶o̶g̶u̶e̶i̶r̶a̶4̶'̶s̶ ̶r̶u̶n̶.̶m̶o̶n̶o̶b̶e̶r̶t̶.̶d̶e̶v̶.̶s̶m̶a̶l̶l̶.̶t̶s̶v̶ ̶s̶c̶o̶r̶i̶n̶g̶ ̶m̶r̶r̶@̶1̶0̶:̶ ̶0̶.̶3̶5̶7̶6̶9̶0̶4̶7̶6̶1̶9̶0̶4̶7̶6̶3̶ ̶o̶n̶ ̶t̶h̶e̶s̶e̶ ̶5̶0̶ ̶q̶u̶e̶s̶t̶i̶o̶n̶s̶.̶ ̶(there was something wrong with the sample, reinvestigating now) This is potentially due to the tokenization difference (here we just use encode_plus that follows ’longest_first’ truncation strategy vs fixed length tokenization of 64 tokens for the query and 448 for the passage). It could also be due to a bunch of other things as we saw similar differences in numbert.

Some other important general changes I made:

  1. python>=3.6 instead of 3.7 (Colab right now uses 3.6).
  2. Refactored Settings (in settings.py).
  3. Added type Optional[str] field id to base class Query.
  4. Update to newer transformers/tokenizers

@rodrigonogueira4
Copy link
Member

@ronakice, thanks a lot for adding this!

Just curious, what is inference speed (query-doc per sec) you got on GPU vs TPU?

@ronakice
Copy link
Member Author

@rodrigonogueira4 right now pygaggle only supports single GPU I believe. Yes, we should add TPU and multi-GPU support soon and benchmark it! I'm prioritizing monoBERT/duoBERT/T5 for both MARCO and TREC first which the URAs could easily run (since they only have single GPU access).



class MsMarcoSettings(Settings):
msmarco_index_path: str = '/content/data/index-msmarco-passage-20191117-0ed488'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we convert all these as arguments to our main functions? It is hard to keep track of which exact command we used to run an experiment when they are in the code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just followed similar settings as that which was included for CovidQA. Maybe things like the index_path which rarely every change can remain here, and we can move model directory to main?

Copy link
10000
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

setup.py Outdated
'tqdm==4.45.0',
'transformers==2.7.0'
'transformers>=2.7.0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we force 2.8.0 already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My part of the code can, I don't know if it will cause issues in the rest (I don't think so..)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. lets try to move to 2.8.0 as it contains some important new functions for T5

Copy link
Member
@rodrigonogueira4 rodrigonogueira4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for implementing this! Just a couple of minor comments

Copy link
Member
@rodrigonogueira4 rodrigonogueira4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me and thanks again for implementing this!

Feel free to merge when you finish converting settings.py as arguments.

@ronakice ronakice merged commit 55e4961 into master Apr 30, 2020
@ronakice ronakice deleted the mono_marco branch April 30, 2020 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0