8000 torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.18 GiB. GPU 3 has a total capacity · Issue #12 · tue-mps/eomt · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.18 GiB. GPU 3 has a total capacity #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
921162820 opened this issue May 16, 2025 · 4 comments

Comments

@921162820
Copy link

During the validation process, I'm running out of VRAM even with a batch_size of 1. My GPU is a 4090. How can I solve this?

@NiccoloCavagnero
Copy link
Collaborator

Could you provide additional information?
Specifically, which dataset, resolution and model size are you adopting?

A quick workaround would be to use a smaller model size on a smaller resolution.

Best,
Nick

@921162820
Copy link
Author

we adopt the model:vit_base_patch14_reg4_dinov2, and img_size: [640, 640], and batch_size is 1 in per gpu. The dataset is my private dataset, which annotation is consisstent with COCO.

@921162820
Copy link
Author

rank1]: Traceback (most recent call last): | 4/84 [00:52<17:32, 0.08it/s]
[rank1]: File "/root/code/eomt-master/main.py", line 186, in
[rank1]: cli_main()
[rank1]: File "/root/code/eomt-master/main.py", line 164, in cli_main
[rank1]: LightningCLI(
[rank1]: File "/root/code/eomt-master/main.py", line 102, in init
[rank1]: super().init(*args, **kwargs)
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/cli.py", line 398, in init
[rank1]: self._run_subcommand(self.subcommand)
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/cli.py", line 708, in _run_subcommand
[rank1]: fn(**fn_kwargs)
[rank1]: File "/root/code/eomt-master/main.py", line 148, in fit
[rank1]: self.trainer.fit(model, **kwargs)
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 561, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 599, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1012, in _run
[rank1]: results = self._run_stage()
[rank1]: ^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/trainer/trainer.py", line 1056, in _run_stage
[rank1]: self.fit_loop.run()
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 216, in run
[rank1]: self.advance()
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/loops/fit_loop.py", line 455, in advance
[rank1]: self.epoch_loop.run(self._data_fetcher)
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 151, in run
[rank1]: self.on_advance_end(data_fetcher)
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 370, in on_advance_end
[rank1]: self.val_loop.run()
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/loops/utilities.py", line 179, in _decorator
[rank1]: return loop_run(self, *args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 145, in run
[rank1]: self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 437, in _evaluation_step
[rank1]: output = call._call_strategy_hook(trainer, hook_name, *step_args)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/trainer/call.py", line 328, in _call_strategy_hook
[rank1]: output = fn(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/strategies/strategy.py", line 411, in validation_step
[rank1]: return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/strategies/strategy.py", line 641, in call
[rank1]: wrapper_output = wrapper_module(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward
[rank1]: else self._run_ddp_forward(*inputs, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward
[rank1]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank1]: return forward_call(args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/lightning/pytorch/strategies/strategy.py", line 634, in wrapped_forward
[rank1]: out = method(
_args, **_kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/EoMT/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 433, in _fn
[rank1]: return fn(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/code/eomt-master/training/lightning_module.py", line 185, in validation_step
[rank1]: def validation_step(self, batch, batch_idx=0):
[rank1]: File "/root/code/eomt-master/training/mask_classification_instance.py", line 82, in eval_step
[rank1]: def eval_step(
[rank1]: File "/root/code/eomt-master/training/mask_classificati 7A8C on_instance.py", line 91, in torch_dynamo_resume_in_eval_step_at_91
[rank1]: transformed_imgs = self.resize_and_pad_imgs_instance_panoptic(imgs)
[rank1]: File "/root/code/eomt-master/training/mask_classification_instance.py", line 125, in torch_dynamo_resume_in_eval_step_at_92
[rank1]:
[rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.77 GiB. GPU 1 has a total capacity of 23.69 GiB of which 1.14 GiB is free. Process 3660512 has 22.54 GiB memory in use. Of the allocated memory 13.04 GiB is allocated by PyTorch, and 9.03 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@NiccoloCavagnero
Copy link
Collaborator

I can confirm that validating a ViT-B at resolution 640x640 with a batch size of 1 takes less than 3GB of memory for both panoptic and instance inference on COCO.

Can you reproduce the same error using default COCO?

Best,
Nick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0