8000 Does transformerEngine support 2080ti? · Issue #1680 · NVIDIA/TransformerEngine · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Does transformerEngine support 2080ti? #1680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SeekPoint opened this issue Apr 14, 2025 · 4 comments
Open

Does transformerEngine support 2080ti? #1680

SeekPoint opened this issue Apr 14, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@SeekPoint
Copy link

while I run another project which used TransformerEngine, it trigger exception:
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py
# raise exception if no backend is available
if sum([use_flash_attention, use_fused_attention, use_unfused_attention]) == 0:
raise ValueError(
"No dot product attention backend is available for the provided inputs. Please"
" run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for"
" disabling all backends."
)

@SeekPoint SeekPoint added the bug Something isn't working label Apr 14, 2025
@ptrendx
Copy link
Member
ptrendx commented Apr 14, 2025

@cyanguwa Could you take a look? 2080ti is Turing (sm75).

@ptrendx
Copy link
Member
ptrendx commented Apr 14, 2025

@SeekPoint could you give more details about the workload that triggered this - e.g. datatype, shapes of tensors? Also, if you could rerun with those suggested environment variables (NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2) and paste the output that would be super helpful.

@SeekPoint
Copy link
Author
SeekPoint commented Apr 15, 2025

I have 4 cards namely special version of 2080ti, each card have 22GB GPU memory

the case come from https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html

Considering the log is huge, only paste the trackback .
Also, I find another one raised the same error in NeMo, NVIDIA/NeMo#11218

the traceback:

[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 4, in
[rank0]: megatron_sft_main()
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 90, in megatron_sft_main
[rank0]: return MegatronSft(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/base.py", line 47, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/train/sft.py", line 75, in run
[rank0]: pretrain(
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 408, in pretrain
[rank0]: iteration, num_floating_point_operations_so_far = train(
[rank0]: ^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1493, in train
[rank0]: train_step(forward_step_func,
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 791, in train_step
[rank0]: losses_reduced = forward_backward_func(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 471, in forward_backward_no_pipelining
[rank0]: output_tensor, num_tokens = forward_step(
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 275, in forward_step
[rank0]: output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/train/utils.py", line 182, in forward_step
[rank0]: output_tensor = model(tokens, position_ids, attention_mask, labels=labels, packed_seq_params=packed_seq_params)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return for 8000 ward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward
[rank0]: return self.module(*inputs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/legacy/model/module.py", line 189, in forward
[rank0]: outputs = self.module(*inputs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 264, in forward
[rank0]: hidden_states = self.decoder(
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 549, in forward
[rank0]: hidden_states, context = layer(
[rank0]: ^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 502, in call
[rank0]: return super(MegatronModule, self).call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 390, in forward
[rank0]: attention_output_with_bias = self.self_attention(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/attention.py", line 450, in forward
[rank0]: core_attn_out = self._checkpointed_attention_forward(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/attention.py", line 170, in _checkpointed_attention_forward
[rank0]: hidden_states = tensor_parallel.checkpoint(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 431, in checkpoint
[rank0]: return CheckpointFunction.apply(function, distribute_saved_activations, *args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 369, in forward
[rank0]: outputs = run_function(*args)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/github/Megatron-LM/megatron/core/transformer/attention.py", line 156, in custom_forward
[rank0]: output
= self.core_attention(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 804, in forward
[rank0]: core_attn_out = super().forward(
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/site-packages/transformer_engine/pytorch/attention.py", line 7736, in forward
[rank0]: raise ValueError("No dot product attention support for the provided inputs!")
[rank0]: ValueError: No dot product attention support for the provided inputs!
[rank1]:[W415 15:38:12.811977692 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[rank3]:[W415 15:38:12.832828834 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0415 15:38:14.751000 78 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 111 closing signal SIGTERM
W0415 15:38:14.753000 78 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 113 closing signal SIGTERM
E0415 15:38:14.792000 78 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 112) of binary: /usr/local/bin/python
Traceback (most recent call last):

@ptrendx
Copy link
Member
ptrendx commented Apr 25, 2025

Hi @SeekPoint, the part of the log that would help us narrow the issue should actually be before the traceback - with the NVTE_DEBUG and NVTE_DEBUG_LEVEL environment variables set there should be lines logging the reasons for why it did not choose any backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
0