8000 [BUG] KNeighborsTimeSeriesClassifier throw an OOM error when it should not · Issue #5914 · sktime/sktime · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[BUG] KNeighborsTimeSeriesClassifier throw an OOM error when it should not #5914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
srggrs opened this issue Feb 9, 2024 · 14 comments · Fixed by #5937, #5939 or #5952
Closed

[BUG] KNeighborsTimeSeriesClassifier throw an OOM error when it should not #5914

srggrs opened this issue Feb 9, 2024 · 14 comments · Fixed by #5937, #5939 or #5952
Labels
bug Something isn't working module:classification classification module: time series classification

Comments

@srggrs
Copy link
srggrs commented Feb 9, 2024

Describe the bug
Fitting a standard KNeighborsTimeSeriesClassifier to a dataset of 250k samples will throw an OOM error even though the memory usage of the training data is very small (in the order of MBs). The error is this

File <command-3793510480034976>, line 2
      1 model = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance="euclidean")
----> 2 model.fit(X, y)

File .../lib/python3.10/site-packages/sktime/classification/base.py:238, in BaseClassifier.fit(self, X, y)
    233         raise AttributeError(
    234             "self.n_jobs must be set if capability:multithreading is True"
    235         )
    237 # pass coerced and checked data to inner _fit
--> 238 self._fit(X, y)
    239 self.fit_time_ = int(round(time.time() * 1000)) - start
    241 # this should happen last: fitted state is set to True

File .../lib/python3.10/site-packages/sktime/classification/distance_based/_time_series_neighbors.py:241, in KNeighborsTimeSeriesClassifier._fit(self, X, y)
    237     _, _, X_meta = check_is_mtype(
    238         X, X_inner_mtype, return_metadata=True, msg_return_dict="list"
    239     )
    240     n = X_meta["n_instances"]
--> 241     dist_mat = np.zeros([n, n], dtype="float")
    243 self.knn_estimator_.fit(dist_mat, y)
    245 return self

MemoryError: Unable to allocate 466. GiB for an array with shape (250000, 250000) and data type float64

To Reproduce
Set up a python >= 3.8 env and install sktime

import numpy as np
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier

X = np.random.rand(250_000, 1, 24)
y = np.random.choice([0, 1], (250_000, 1))

# it should be ~ 45MB
print(f"size of X {X.size * X.itemsize / 1024 ** 2} MB")

model = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance="euclidean")
model.fit(X, y)

Expected behavior
It should train without throwing a memory error

Additional context
I tried other libraries that offer timeseries clustering for python with the same algorithm (kNN) and there is no problem in fitting over this dataset.

Versions
latest

System:
    python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
executable: /..../bin/python
   machine: Linux-5.15.0-1052-aws-x86_64-with-glibc2.35

Python dependencies:
          pip: 22.2.2
       sktime: 0.26.0
      sklearn: 1.1.1
       skbase: 0.7.2
        numpy: 1.21.5
        scipy: 1.9.1
       pandas: 1.4.4
   matplotlib: 3.5.2
       joblib: 1.2.0
        numba: None
  statsmodels: 0.13.2
     pmdarima: None
statsforecast: None
      tsfresh: None
      tslearn: None
        torch: None
   tensorflow: None
tensorflow_probability: None
@srggrs srggrs added the bug Something isn't working label Feb 9, 2024
@fkiraly
Copy link
Collaborator
fkiraly commented Feb 12, 2024

I think this is due to the distance matix computed interally - that is of size 250.000 times 250.000. Internally, sklearn with precomputed metric is called.

@fkiraly fkiraly added the module:classification classification module: time series classification label Feb 12, 2024
@fkiraly
Copy link
Collaborator
fkiraly commented Feb 12, 2024

To reduce the size of the matrix, you could try BaggingClassifier with n_samples smaller.

You say you tried other libraies where you do not get oom - which, if I may ask? We could simply interface these.

@srggrs
Copy link
Author
srggrs commented Feb 13, 2024

thx for the explanation and suggestion. I tried aeon TimeSeriesKMeans but I can see they do not use sklearn kNN under the hood but they simply do a distance computation with respect the training data...

@fkiraly
Copy link
Collaborator
fkiraly commented Feb 13, 2024

But that's a different algorithm - k-means, not knn.

Anyway, I fixed up a version where a callable is passed to sklearn, not a distance matrix, so and different algorithms should be available:
#5937
(this should trade memory use against compute)

What would be appreciated is testing, would you be able to check whether #5937 is less memory hungry? Works only for str metrics currently. If yes, I'd extend it and we could add it as another option.

@fkiraly
Copy link
Collaborator
fkiraly commented Feb 13, 2024

Also, have you tried pyts k-neighbors?
https://pyts.readthedocs.io/en/latest/generated/pyts.classification.KNeighborsClassifier.html
(the aeon one looks a bit derivative of this)

We've also been trying to add interfaces to pyts to integrate the ecosystem better.

Again, testing & report back would be much appreciated!
E.g., if the pyts is Pareto-better, we could replace the sktime one by it (by rename-switch, we don't delete typicaly)

@fkiraly
Copy link
Collaborator
fkiraly commented Feb 13, 2024

Also for testing, here is an interface to the pyts class: #5939 8000

Having these all in the sktime interface should make it easier to compare, feedback appreciated.

(I already see that it has a different set of distances to choose from...)

@srggrs
Copy link
Author
srggrs commented Feb 13, 2024

mmm interesting! Thank you for posting that! I was digging into the sklearn kNNclassifier and found that potentially you can pass something like a graph to avoid OOM and massive matrix. Perhaps KNeighborsTransformer which use kneighbors_graph under the hood would work. So training samples > threshould it can switch to use that?

@srggrs
Copy link
Author
srggrs commented Feb 13, 2024

I will test that PR asap

@fkiraly
Copy link
Collaborator
fkiraly commented Feb 13, 2024

Thanks!

Btw, there's a runtime profiler in sktime, in utils.profiling, profile_classifier.

Given that your problem is memory - do you know of a quick way to profile memory? This could be added as a feature, currently the profile_classifier function is blind to that afaik.

@fkiraly
Copy link
Collaborator
fkiraly commented Feb 13, 2024

Perhaps KNeighborsTransformer which use kneighbors_graph under the hood would work. So training samples > threshould it can switch to use that?

Hm, interesting idea. It does not work with precomputed matrix, and needs a callable like we pass in #5937.
Can you describe the algorithm you are suggesting? Am I guessing right in that you first want to use the transformer, then pass a sparse matrix to sklearn knn?

That might work, but as said it would require passing a callable in the first place, and doing that could fix your problem without recurring to KNeighborsTransformer, as in #5937.

@fkiraly
Copy link
Collaborator
fkiraly commented Feb 16, 2024

I will test that PR asap

did you manage to check?

@fkiraly
Copy link
Collaborator
fkiraly commented Feb 16, 2024

I got ball_tree working via adaptation, but kd_tree does not work, as sklearn KNeighborsClassifier. accepts neither precomputed nor callable distances.

@fkiraly
Copy link
Collaborator
fkiraly commented Feb 16, 2024

Here's the one from tslearn: #5952

So, in the end you could try any of the following:

Some questions:

  • can you explain your proposed approach using KNeighborsTransformer?
  • do you want to use the classifier with custom distances? If so, I can write some adapter code for the pyts or tslearn adapters to allow use of sktime distances.

As already mentioned, testing would be appreciated.

fkiraly added a commit that referenced this issue Feb 18, 2024
Adapter for `pyts.classification.KNeighborsClassifier`, using the
generic adapter introduced in #5851.
Serves as a test case for classifiers, and possibly fixes
#5914.

Depends on #5851 for the adapter.
fkiraly added a commit that referenced this issue Feb 24, 2024
Adapter for `tslearn.neighbors.KNeighborsTimeSeriesClassifier`, using the generic adapter introduced in #4992.
Serves as a test case for classifiers, and possibly fixes
#5914.
@fkiraly
Copy link
Collaborator
fkiraly commented Feb 28, 2024

@srggrs, let us know if any of the three KNN released with 0.26.1 solves the issue (pyts, tslearn, and the new parameter to the sktime native one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:classification classification module: time series classification
Projects
None yet
2 participants
0