-
Notifications
You must be signed in to change notification settings - Fork 21
The finetuning in tensor parallel mode does not work as expected #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community 8000 .
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
After an investigation with Lina, it seems that the problem is related to the renaming of https://github.com/vince62s/eole/blob/bbd620c8be47c2ab51c1d0b64e35d737352d1087/eole/modules/transformer_mlp.py#L47 without changing the string Lines 552 to 558 in 3a9b137
We will think of a fix making the code robust to this kind of change that also broke all the convert scripts and think of a few tests that could cover that. Do you have a space to share a discussion about the strategy and the goal of this repo ? |
this is not happening so often, I will fix the few places where I missed those
I suggest the Discussions tab of this repo |
see #30 let me know if it fixes completely the issue. |
It seems to work well, [2024-06-14 06:52:41,726 INFO] Starting training on GPU: [0, 1]
[2024-06-14 06:52:41,726 INFO] Start training loop and validate every 200 steps...
[2024-06-14 06:52:41,727 INFO] Scoring with: {'insert_mask_before_placeholder': InsertMaskBeforePlaceholdersTransform(), 'onmt_tokenize': ONMTTokenizerTransform(share_vocab=True, src_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, src_onmttok_kwargs={'mode': 'none'}, tgt_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, tgt_onmttok_kwargs={'mode': 'none'}), 'filtertoolong': FilterTooLongTransform(src_seq_length=512, tgt_seq_length=512)}
This fp16_optimizer is designed to only work with apex.contrib.optimizers.*
To update, use updated optimizers with AMP.
[2024-06-14 06:52:43,949 INFO] Weighted corpora loaded so far:
* cred_dataset: 1
[2024-06-14 06:52:44,387 INFO] Weighted corpora loaded so far:
* cred_dataset: 1
[2024-06-14 06:53:27,376 INFO] Step 10/20000; acc: 49.2; ppl: 249.85; xent: 5.52; aux: 0.000; lr: 2.00e-05; sents: 320; bsz: 365/ 114/ 1; 2561/800 tok/s; 46 sec;
[2024-06-14 06:54:08,233 INFO] Step 20/20000; acc: 50.2; ppl: 189.43; xent: 5.24; aux: 0.000; lr: 2.00e-05; sents: 320; bsz: 375/ 117/ 1; 2937/916 tok/s; 87 sec;
[2024-06-14 06:54:48,808 INFO] Step 30/20000; acc: 51.1; ppl: 152.03; xent: 5.02; aux: 0.000; lr: 2.00e-05; sents: 320; bsz: 375/ 117/ 1; 2958/923 tok/s; 127 sec; |
My finetuning logs show very different statistics depending on whether finetuning is run in single or multi-gpu mode..
For instance;
The text was updated successfully, but these errors were encountered: