Fine-tuning fails with error AssertionError: An error in model's partition and checkpoint's slice was detected

Hello,

I am trying to finetune a llama3-8b model on 2 gpus but I keep getting the following error:

Traceback (most recent call last):
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/utils/distributed.py", line 179, in spawned_train
    process_fn(config, device_id=device_id)
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/train_single.py", line 169, in main
    model, _, _ = get_model_class(config.model).from_config(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 495, in from_config
    model.training_logic(running_config, vocabs, checkpoint, device_id)
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 288, in training_logic
    self.load_checkpoint(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 248, in load_checkpoint
    self.load_safe_state_dict(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 706, in load_safe_state_dict
    self._load_param(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 572, in _load_param
    param.data.size()
AssertionError: An error in model's partition and checkpoint's slice was detected

Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt

I got this error both on commit 4954c12 and on commit 7077ddf. I also tried to run this on two different pairs of gpus but the result did not change.

Yesterday I had launched the exact same fine-tuning and it run fine (besides the tensor parallel model issue that was fixed in the meantime).

Do you have any hint as to why this could be happening?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions