10000 [BUG] Low GPU Utilization · Issue #8278 · sktime/sktime · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[BUG] Low GPU Utilization #8278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jobs-git opened this issue May 22, 2025 · 5 comments
Open

[BUG] Low GPU Utilization #8278

jobs-git opened this issue May 22, 2025 · 5 comments
Labels
bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting

Comments

@jobs-git
Copy link
Contributor
jobs-git commented May 22, 2025

Describe the bug
GPU utilization has too much idle during training which overall lengthens the calculation time.

Related: sktime/pytorch-forecasting#1426

To Reproduce

# import packages
import torch
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.pytorchforecasting import PytorchForecastingNHiTS
from sktime.utils._testing.hierarchical import _make_hierarchical
from sklearn.model_selection import train_test_split
# generate random data
data = _make_hierarchical(
    hierarchy_levels=(128, 512), max_timepoints=128, min_timepoints=128, n_columns=1
)

max_prediction_length = 5
fh = ForecastingHorizon(range(1, max_prediction_length + 1), is_relative=True)

y = data["c0"].to_frame()

model = PytorchForecastingNHiTS(
    model_params= {
        #"n_blocks": [10,10],
        "activation": "ReLU",
        "hidden_size": 512,
        "n_blocks": [1, 1, 1],
        "n_layers": 2,
    },
    trainer_params={
        "max_epochs": 5, 
        "limit_train_batches": 1,
    },
    train_to_dataloader_params = {"batch_size": 1024*16, "num_workers": 2},
    validation_to_dataloader_params = {"batch_size": 1024*16, "num_workers": 2},

)

model.fit(y=y, fh=fh) # doctest skip
y_pred = model.predict(fh, y=y)
print(y_test)

Expected behavior
GPU should mostly be busy with little idle here and there

Additional context

Versions

machine: Linux-6.8.0-53-generic-x86_64-with-glibc2.39

Python dependencies:
          pip: 25.0.1
       sktime: 0.37.0
      sklearn: 1.5.2
       skbase: 0.8.3
        numpy: 1.26.4
        scipy: 1.15.2
       pandas: 2.2.2
   matplotlib: 3.10.1
       joblib: 1.4.2
        numba: 0.61.0
  statsmodels: 0.14.4
     pmdarima: 2.0.4
statsforecast: 1.7.8
      tsfresh: 0.21.0
      tslearn: 0.6.3
        torch: 2.7.0
   tensorflow: 2.19.0
@jobs-git jobs-git added the bug Something isn't working label May 22, 2025
@github-project-automation github-project-automation bot moved this to Needs triage & validation in Bugfixing May 22, 2025
@fkiraly fkiraly added the module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting label May 22, 2025
@fkiraly
Copy link
Collaborator
fkiraly commented May 22, 2025

FYI @phoeenniixx, @PranavBhatP, @agobbifbk - any ideas?

@jobs-git jobs-git changed the title [BUG] Low GPU utilization [BUG] Slow GPU calculation May 23, 2025
@jobs-git jobs-git changed the title [BUG] Slow GPU calculation [BUG] Low GPU Utilization May 23, 2025
@Green-Kedia
Copy link

@jobs-git can I ask why limit_train_batches is set to 1? maybe thats what causing the model to just process one batch per epoch. I suggest either removing it, or set it to float 1.0.
Thanks

@jobs-git
Copy link
Contributor Author
jobs-git commented May 24, 2025

@jobs-git can I ask why limit_train_batches is set to 1? maybe thats what causing the model to just process one batch per epoch. I suggest either removing it, or set it to float 1.0. Thanks

Interesting note, but I am not sure why setting it lower makes the epoch complete faster and overall training is also faster.

@agobbifbk
Copy link

Hard to say, also the batch size so big can cause bottlenecks (transferring data from gpu to cpu) or even the number of workers (try with defaults).

@jobs-git
Copy link
Contributor Author
jobs-git commented May 26, 2025

Hard to say, also the batch size so big can cause bottlenecks (transferring data from gpu to cpu) or even the number of workers (try with defaults).

I was able to trace it to the torch.dataloader, I raised an issue here:

pytorch/pytorch#154318

manually batching is faster, but dataloader is just too slow. Anyway to bypass data loader and use manual batching?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting
Projects
Status: Needs triage & validation
Development

No branches or pull requests

4 participants
0