-
Notifications
You must be signed in to change notification settings - Fork 21
Fine-tuning fails with error AssertionError: An error in model's partition and checkpoint's slice was detected #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
post your config, but maybe @l-k-11235 can help she tested this morning and was fine. |
Thanks for the reply. Here are my configs. I've already checked with Lina but we were not able to identify the issue General settingsseed: 1234 overwrite: true report_every: 10 n_sample: 0 tensorboard: true transforms configtransforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong] transforms_configs: datasetsdata: skip_empty_level: silent # silently ignore empty lines in the data training:
|
You need to rename w_1 2 and 3 |
also git pull, last fix just pushed |
Hello,
I am trying to finetune a llama3-8b model on 2 gpus but I keep getting the following error:
I got this error both on commit 4954c12 and on commit 7077ddf. I also tried to run this on two different pairs of gpus but the result did not change.
Yesterday I had launched the exact same fine-tuning and it run fine (besides the tensor parallel model issue that was fixed in the meantime).
Do you have any hint as to why this could be happening?
Thanks
The text was updated successfully, but these errors were encountered: