8000 Fine-tuning fails with error AssertionError: An error in model's partition and checkpoint's slice was detected · Issue #31 · eole-nlp/eole · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Fine-tuning fails with error AssertionError: An error in model's partition and checkpoint's slice was detected #31
Closed
@randy-ac

Description

@randy-ac

Hello,

I am trying to finetune a llama3-8b model on 2 gpus but I keep getting the following error:

Traceback (most recent call last):
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/utils/distributed.py", line 179, in spawned_train
    process_fn(config, device_id=device_id)
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/train_single.py", line 169, in main
    model, _, _ = get_model_class(config.model).from_config(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 495, in from_config
    model.training_logic(running_config, vocabs, checkpoint, device_id)
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 288, in training_logic
    self.load_checkpoint(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 248, in load_checkpoint
    self.load_safe_state_dict(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 706, in load_safe_state_dict
    self._load_param(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 572, in _load_param
    param.data.size()
AssertionError: An error in model's partition and checkpoint's slice was detected

Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt

I got this error both on commit 4954c12 and on commit 7077ddf. I also tried to run this on two different pairs of gpus but the result did not change.

Yesterday I had launched the exact same fine-tuning and it run fine (besides the tensor parallel model issue that was fixed in the meantime).

Do you have any hint as to why this could be happening?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0