The finetuning in tensor parallel mode does not work as expected #18

l-k-11235 · 2024-06-11T08:16:15Z

My finetuning logs show very different statistics depending on whether finetuning is run in single or multi-gpu mode..
For instance;

With 2 gpus:

normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
[2024-06-10 15:23:11,451 INFO] Weighted corpora loaded so far:
			* train_dataset: 1
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
[2024-06-10 15:23:11,648 INFO] Weighted corpora loaded so far:
			* train_dataset: 1
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[2024-06-10 15:24:15,087 INFO] Step 10/20000; acc: 21.6; ppl: 1004.81; xent: 6.91; aux: 0.000; lr: 2.00e-05; sents:     160; bsz: 1076/ 357/ 1; 2622/868 tok/s;     66 sec;
[2024-06-10 15:25:14,439 INFO] Step 20/20000; acc: 44.4; ppl: 77.83; xent: 4.35; aux: 0.000; lr: 2.00e-05; sents:     160; bsz: 1037/ 341/ 1; 2795/920 tok/s;    125 sec;
[2024-06-10 15:26:14,764 INFO] Step 30/20000; acc: 45.3; ppl: 47.23; xent: 3.86; aux: 0.000; lr: 2.00e-05; sents:     160; bsz: 1045/ 344/ 1; 2773/912 tok/s;    185 sec;

With 1 gpu:

[2024-06-10 10:09:14,672 INFO] Step 10/20000; acc: 50.1; ppl: 227.58; xent: 5.43; aux: 0.000; lr: 2.00e-05; sents:     160; bsz:  370/ 115/ 1; 1616/503 tok/s;     37 sec;
[2024-06-10 10:09:48,438 INFO] Step 20/20000; acc: 49.3; ppl: 208.37; xent: 5.34; aux: 0.000; lr: 2.00e-05; sents:     160; bsz:  361/ 113/ 1; 1710/535 tok/s;     70 sec;
[2024-06-10 10:10:23,572 INFO] Step 30/20000; acc: 50.2; ppl: 162.91; xent: 5.09; aux: 0.000; lr: 2.00e-05; sents:     160; bsz:  379/ 118/ 1; 1726/536 tok/s;    106 sec;

funboarder13920 · 2024-06-13T17:09:47Z

After an investigation with Lina, it seems that the problem is related to the renaming of https://github.com/vince62s/eole/blob/bbd620c8be47c2ab51c1d0b64e35d737352d1087/eole/modules/transformer_mlp.py#L47 without changing the string

eole/eole/models/model.py

Lines 552 to 558 in 3a9b137

    
           if name.split(".")[-1] in [ 
        
               "linear_keys", 
        
               "linear_values", 
        
               "linear_query", 
        
               "w_1", 
        
               "w_3", 
        
           ]:

We will think of a fix making the code robust to this kind of change that also broke all the convert scripts and think of a few tests that could cover that.

Do you have a space to share a discussion about the strategy and the goal of this repo ?

vince62s · 2024-06-13T18:47:16Z

We will think of a fix making the code robust to this kind of change that also broke all the convert scripts and think of a few tests that could cover that.

this is not happening so often, I will fix the few places where I missed those

Do you have a space to share a discussion about the strategy and the goal of this repo ?

I suggest the Discussions tab of this repo

vince62s · 2024-06-13T19:41:01Z

see #30 let me know if it fixes completely the issue.

l-k-11235 · 2024-06-14T07:15:29Z

It seems to work well,
with 2 gpus I now have this in the logs :

[2024-06-14 06:52:41,726 INFO] Starting training on GPU: [0, 1]
[2024-06-14 06:52:41,726 INFO] Start training loop and validate every 200 steps...
[2024-06-14 06:52:41,727 INFO] Scoring with: {'insert_mask_before_placeholder': InsertMaskBeforePlaceholdersTransform(), 'onmt_tokenize': ONMTTokenizerTransform(share_vocab=True, src_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, src_onmttok_kwargs={'mode': 'none'}, tgt_subword_kwargs={'bpe_model_path': '/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole//llama3-8b/bpe.model', 'bpe_dropout': 0.0}, tgt_onmttok_kwargs={'mode': 'none'}), 'filtertoolong': FilterTooLongTransform(src_seq_length=512, tgt_seq_length=512)}

This fp16_optimizer is designed to only work with apex.contrib.optimizers.*
To update, use updated optimizers with AMP.
[2024-06-14 06:52:43,949 INFO] Weighted corpora loaded so far:
			* cred_dataset: 1
[2024-06-14 06:52:44,387 INFO] Weighted corpora loaded so far:
			* cred_dataset: 1
[2024-06-14 06:53:27,376 INFO] Step 10/20000; acc: 49.2; ppl: 249.85; xent: 5.52; aux: 0.000; lr: 2.00e-05; sents:     320; bsz:  365/ 114/ 1; 2561/800 tok/s;     46 sec;
[2024-06-14 06:54:08,233 INFO] Step 20/20000; acc: 50.2; ppl: 189.43; xent: 5.24; aux: 0.000; lr: 2.00e-05; sents:     320; bsz:  375/ 117/ 1; 2937/916 tok/s;     87 sec;
[2024-06-14 06:54:48,808 INFO] Step 30/20000; acc: 51.1; ppl: 152.03; xent: 5.02; aux: 0.000; lr: 2.00e-05; sents:     320; bsz:  375/ 117/ 1; 2958/923 tok/s;    127 sec;

francoishernandez added the bug Something isn't working label Jun 12, 2024

vince62s closed this as completed Jun 14, 2024

randy-ac mentioned this issue Jun 14, 2024

Fine-tuning fails with error AssertionError: An error in model's partition and checkpoint's slice was detected #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The finetuning in tensor parallel mode does not work as expected #18

The finetuning in tensor parallel mode does not work as expected #18

The finetuning in tensor parallel mode does not work as expected #18

The finetuning in tensor parallel mode does not work as expected #18

Comments