8000 Fine-tuning fails with error AssertionError: An error in model's partition and checkpoint's slice was detected · Issue #31 · eole-nlp/eole · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fine-tuning fails with error AssertionError: An error in model's partition and checkpoint's slice was detected #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
randy-ac opened this issue Jun 14, 2024 · 4 comments

Comments

@randy-ac
Copy link

Hello,

I am trying to finetune a llama3-8b model on 2 gpus but I keep getting the following error:

Traceback (most recent call last):
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/utils/distributed.py", line 179, in spawned_train
    process_fn(config, device_id=device_id)
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/train_single.py", line 169, in main
    model, _, _ = get_model_class(config.model).from_config(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 495, in from_config
    model.training_logic(running_config, vocabs, checkpoint, device_id)
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 288, in training_logic
    self.load_checkpoint(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 248, in load_checkpoint
    self.load_safe_state_dict(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 706, in load_safe_state_dict
    self._load_param(
  File "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/eole/eole/models/model.py", line 572, in _load_param
    param.data.size()
AssertionError: An error in model's partition and checkpoint's slice was detected

Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt

I got this error both on commit 4954c12 and on commit 7077ddf. I also tried to run this on two different pairs of gpus but the result did not change.

Yesterday I had launched the exact same fine-tuning and it run fine (besides the tensor parallel model issue that was fixed in the meantime).

Do you have any hint as to why this could be happening?

Thanks

@vince62s
Copy link
Contributor

post your config, but maybe @l-k-11235 can help she tested this morning and was fine.

@randy-ac
Copy link
Author
randy-ac commented Jun 14, 2024

Thanks for the reply. Here are my configs. I've already checked with Lina but we were not able to identify the issue

General settings

seed: 1234
share_vocab: true
save_data: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-finetune"
src_vocab: "${EOLE_MODEL_DIR}/llama3-8b/vocab.txt" # size
src_vocab_size: 128256
tgt_vocab_size: 128256

overwrite: true

report_every: 10

n_sample: 0

tensorboard: true
tensorboard_log_dir: /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-finetune/logs/

transforms config

transforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong]

transforms_configs:
insert_mask_before_placeholder:
response_patterns: ["⦅newline⦆⦅newline⦆### Response : ⦅newline⦆"]
onmt_tokenize:
src_subword_type: bpe
src_subword_model: "${EOLE_MODEL_DIR}/llama3-8b/bpe.model"
tgt_subword_type: bpe
tgt_subword_model: "${EOLE_MODEL_DIR}/llama3-8b/bpe.model"
gpt2_pretok: true
filtertoolong:
src_seq_length: 2048
tgt_seq_length: 2048

datasets

data:
new_synth_dataset:
path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset/synthetic-dataset-with-roles_train.shuffle"
weight: 1
valid:
path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset/synthetic-dataset-with-roles_dev.shuffle"

skip_empty_level: silent # silently ignore empty lines in the data

training:
# GPU dispatching
world_size: 2
gpu_ranks: [0, 1]

parallel_mode: "tensor_parallel"
zero_out_prompt_loss: true

train_steps: 20000
valid_steps: 200

dropout_steps: [0]
dropout: [0.0]
attention_dropout: [0.0]
# Batching
bucket_size: 10
num_workers: 1
batch_type: "sents"
batch_size: 1
valid_batch_size: 1
batch_size_multiple: 1

# Optimization
model_dtype: "fp16"
apex_opt_level: ""
optim: "fusedadam"
learning_rate: 2e-05
warmup_steps: 100
decay_method: "none"
#learning_rate_decay: 0.98
#start_decay_steps: 100
#decay_steps: 10
adam_beta2: 0.998
accum_count: [16] #[8]
accum_steps: [0]
max_grad_norm: 0
label_smoothing: 0.0
param_init: 0
param_init_glorot: true
normalization: "tokens"

# folders
train_from: "${EOLE_MODEL_DIR}/llama3-8b"
model_path: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-finetune"
keep_checkpoint: 30
save_checkpoint_steps: 500

# 4/8bit
quant_layers: ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
quant_type: "bnb_NF4"

# LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 16 #5 #2
lora_dropout: 0.05
lora_alpha: 32
lora_embedding: false

@vince62s
Copy link
Contributor

You need to rename w_1 2 and 3

@vince62s
Copy link
Contributor

also git pull, last fix just pushed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0