[BUG] Low GPU Utilization #8278

jobs-git · 2025-05-22T13:38:52Z

Describe the bug
GPU utilization has too much idle during training which overall lengthens the calculation time.

Related: sktime/pytorch-forecasting#1426

To Reproduce

# import packages
import torch
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.pytorchforecasting import PytorchForecastingNHiTS
from sktime.utils._testing.hierarchical import _make_hierarchical
from sklearn.model_selection import train_test_split
# generate random data
data = _make_hierarchical(
    hierarchy_levels=(128, 512), max_timepoints=128, min_timepoints=128, n_columns=1
)

max_prediction_length = 5
fh = ForecastingHorizon(range(1, max_prediction_length + 1), is_relative=True)

y = data["c0"].to_frame()

model = PytorchForecastingNHiTS(
    model_params= {
        #"n_blocks": [10,10],
        "activation": "ReLU",
        "hidden_size": 512,
        "n_blocks": [1, 1, 1],
        "n_layers": 2,
    },
    trainer_params={
        "max_epochs": 5, 
        "limit_train_batches": 1,
    },
    train_to_dataloader_params = {"batch_size": 1024*16, "num_workers": 2},
    validation_to_dataloader_params = {"batch_size": 1024*16, "num_workers": 2},

)

model.fit(y=y, fh=fh) # doctest skip
y_pred = model.predict(fh, y=y)
print(y_test)

Expected behavior
GPU should mostly be busy with little idle here and there

Additional context

Versions

machine: Linux-6.8.0-53-generic-x86_64-with-glibc2.39

Python dependencies:
          pip: 25.0.1
       sktime: 0.37.0
      sklearn: 1.5.2
       skbase: 0.8.3
        numpy: 1.26.4
        scipy: 1.15.2
       pandas: 2.2.2
   matplotlib: 3.10.1
       joblib: 1.4.2
        numba: 0.61.0
  statsmodels: 0.14.4
     pmdarima: 2.0.4
statsforecast: 1.7.8
      tsfresh: 0.21.0
      tslearn: 0.6.3
        torch: 2.7.0
   tensorflow: 2.19.0

fkiraly · 2025-05-22T18:38:54Z

FYI @phoeenniixx, @PranavBhatP, @agobbifbk - any ideas?

Green-Kedia · 2025-05-23T18:22:17Z

@jobs-git can I ask why limit_train_batches is set to 1? maybe thats what causing the model to just process one batch per epoch. I suggest either removing it, or set it to float 1.0.
Thanks

jobs-git · 2025-05-24T05:25:11Z

@jobs-git can I ask why limit_train_batches is set to 1? maybe thats what causing the model to just process one batch per epoch. I suggest either removing it, or set it to float 1.0. Thanks

Interesting note, but I am not sure why setting it lower makes the epoch complete faster and overall training is also faster.

agobbifbk · 2025-05-26T06:18:52Z

Hard to say, also the batch size so big can cause bottlenecks (transferring data from gpu to cpu) or even the number of workers (try with defaults).

jobs-git · 2025-05-26T06:32:29Z

Hard to say, also the batch size so big can cause bottlenecks (transferring data from gpu to cpu) or even the number of workers (try with defaults).

I was able to trace it to the torch.dataloader, I raised an issue here:

pytorch/pytorch#154318

manually batching is faster, but dataloader is just too slow. Anyway to bypass data loader and use manual batching?

jobs-git added the bug Something isn't working label May 22, 2025

github-project-automation bot added this to Bugfixing May 22, 2025

github-project-automation bot moved this to Needs triage & validation in Bugfixing May 22, 2025

fkiraly added the module:forecasting forecasting module: forecasting, incl probabilistic and hierarchical forecasting label May 22, 2025

jobs-git changed the title ~~[BUG] Low GPU utilization~~ [BUG] Slow GPU calculation May 23, 2025

jobs-git changed the title ~~[BUG] Slow GPU calculation~~ [BUG] Low GPU Utilization May 23, 2025

jobs-git mentioned this issue May 27, 2025

[ENH] Precompute data to massively accelerate training in GPU sktime/pytorch-forecasting#1850

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Low GPU Utilization #8278

[BUG] Low GPU Utilization #8278

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[BUG] Low GPU Utilization #8278

[BUG] Low GPU Utilization #8278

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!