Closed
Description
Hello,
I am trying to finetune a llama3-8b model on 2 gpus but I keep getting the following error:
Traceback (most recent call last):
File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/utils/distributed.py", line 179, in spawned_train
process_fn(config, device_id=device_id)
File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/train_single.py", line 169, in main
model, _, _ = get_model_class(config.model).from_config(
File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 495, in from_config
model.training_logic(running_config, vocabs, checkpoint, device_id)
File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 288, in training_logic
self.load_checkpoint(
File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 248, in load_checkpoint
self.load_safe_state_dict(
File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 706, in load_safe_state_dict
self._load_param(
File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 572, in _load_param
param.data.size()
AssertionError: An error in model's partition and checkpoint's slice was detected
Process SpawnProcess-2:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
KeyboardInterrupt
I got this error both on commit 4954c12 and on commit 7077ddf. I also tried to run this on two different pairs of gpus but the result did not change.
Yesterday I had launched the exact same fine-tuning and it run fine (besides the tensor parallel model issue that was fixed in the meantime).
Do you have any hint as to why this could be happening?
Thanks
Metadata
Metadata
Assignees
Labels
No labels