8000 The finetuning in tensor parallel mode does not work as expected · Issue #18 · eole-nlp/eole · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

The finetuning in tensor parallel mode does not work as expected #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community 8000 .

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
l-k-11235 opened this issue Jun 11, 2024 · 4 comments
Closed

The finetuning in tensor parallel mode does not work as expected #18

l-k-11235 opened this issue Jun 11, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@l-k-11235
Copy link
Contributor
l-k-11235 commented Jun 11, 2024

My finetuning logs show very different statistics depending on whether finetuning is run in single or multi-gpu mode..
For instance;

  • With 2 gpus:
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
[2024-06-10 15:23:11,451 INFO] Weighted corpora loaded so far:
			* train_dataset: 1
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
[2024-06-10 15:23:11,648 INFO] Weighted corpora loaded so far:
			* train_dataset: 1
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[2024-06-10 15:24:15,087 INFO] Step 10/20000; acc: 21.6; ppl: 1004.81; xent: 6.91; aux: 0.000; lr: 2.00e-05; sents:     160; bsz: 1076/ 357/ 1; 2622/868 tok/s;     66 sec;
[2024-06-10 15:25:14,439 INFO] Step 20/20000; acc: 44.4; ppl: 77.83; xent: 4.35; aux: 0.000; lr: 2.00e-05; sents:     160; bsz: 1037/ 341/ 1; 2795/920 tok/s;    125 sec;
[2024-06-10 15:26:14,764 INFO] Step 30/20000; acc: 45.3; ppl: 47.23; xent: 3.86; aux: 0.000; lr: 2.00e-05; sents:     160; bsz: 1045/ 344/ 1; 2773/912 tok/s;    185 sec;
  • With 1 gpu:
[2024-06-10 10:09:14,672 INFO] Step 10/20000; acc: 50.1; ppl: 227.58; xent: 5.43; aux: 0.000; lr: 2.00e-05; sents:     160; bsz:  370/ 115/ 1; 1616/503 tok/s;     37 sec;
[2024-06-10 10:09:48,438 INFO] Step 20/20000; acc: 49.3; ppl: 208.37; xent: 5.34; aux: 0.000; lr: 2.00e-05; sents:     160; bsz:  361/ 113/ 1; 1710/535 tok/s;     70 sec;
[2024-06-10 10:10:23,572 INFO] Step 30/20000; acc: 50.2; ppl: 162.91; xent: 5.09; aux: 0.000; lr: 2.00e-05; sents:     160; bsz:  379/ 118/ 1; 1726/536 tok/s;    106 sec;
@francoishernandez francoishernandez added the bug Something isn't working label Jun 12, 2024
@funboarder13920
Copy link

After an investigation with Lina, it seems that the problem is related to the renaming of https://github.com/vince62s/eole/blob/bbd620c8be47c2ab51c1d0b64e35d737352d1087/eole/modules/transformer_mlp.py#L47 without changing the string

eole/eole/models/model.py

Lines 552 to 558 in 3a9b137

if name.split(".")[-1] in [
"linear_keys",
"linear_values",
"linear_query",
"w_1",
"w_3",
]:

We will think of a fix making the code robust to this kind of change that also broke all the convert scripts and think of a few tests that could cover that.

Do you have a space to share a discussion about the strategy and the goal of this repo ?

@vince62s
Copy link
Contributor

We will think of a fix making the code robust to this kind of change that also broke all the convert scripts and think of a few tests that could cover that.

this is not happening so often, I will fix the few places where I missed those

Do you have a space to share a discussion about the strategy and the goal of this repo ?

I suggest the Discussions tab of this repo

@vince62s
Copy link
Contributor

see #30 let me know if it fixes completely the issue.

@l-k-11235
Copy link
Contributor Author
l-k-11235 commented Jun 14, 2024

It seems to work well,
with 2 gpus I now have this in the logs :

[2024-06-14 06:52:41,726 INFO] Starting training on GPU: [0, 1]
[2024-06-14 06:52:41,726 INFO] Start training loop and validate every 200 steps...
[2024-06-14 06:52:41,727 INFO] Scoring with: {'insert_mask_before_placeholder': InsertMaskBeforePlaceholdersTransform(), 'onmt_tokenize': ONMTTokenizerTransform(share_vocab=True, src_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, src_onmttok_kwargs={'mode': 'none'}, tgt_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, tgt_onmttok_kwargs={'mode': 'none'}), 'filtertoolong': FilterTooLongTransform(src_seq_length=512, tgt_seq_length=512)}

This fp16_optimizer is designed to only work with apex.contrib.optimizers.*
To update, use updated optimizers with AMP.
[2024-06-14 06:52:43,949 INFO] Weighted corpora loaded so far:
			* cred_dataset: 1
[2024-06-14 06:52:44,387 INFO] Weighted corpora loaded so far:
			* cred_dataset: 1
[2024-06-14 06:53:27,376 INFO] Step 10/20000; acc: 49.2; ppl: 249.85; xent: 5.52; aux: 0.000; lr: 2.00e-05; sents:     320; bsz:  365/ 114/ 1; 2561/800 tok/s;     46 sec;
[2024-06-14 06:54:08,233 INFO] Step 20/20000; acc: 50.2; ppl: 189.43; xent: 5.24; aux: 0.000; lr: 2.00e-05; sents:     320; bsz:  375/ 117/ 1; 2937/916 tok/s;     87 sec;
[2024-06-14 06:54:48,808 INFO] Step 30/20000; acc: 51.1; ppl: 152.03; xent: 5.02; aux: 0.000; lr: 2.00e-05; sents:     320; bsz:  375/ 117/ 1; 2958/923 tok/s;    127 sec;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants
0