[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counterintuitive AttributeError in Birch for very large numbers #17966

Closed
THaar50 opened this issue Jul 21, 2020 · 8 comments · Fixed by #23395
Closed

Counterintuitive AttributeError in Birch for very large numbers #17966

THaar50 opened this issue Jul 21, 2020 · 8 comments · Fixed by #23395

Comments

@THaar50
Copy link
THaar50 commented Jul 21, 2020

Describe the bug

Input data containing very large numbers causes overflows in the Birch algorithm, that manifest in different errors depending on the branching factor parameter. If the number of data points is smaller than or equal to the branching factor a ValueError is thrown in AgglomerativeClustering, but if this number exceeds the branching factor an AttributeError is thrown instead. Since both errors are caused by the input data I would expect to get a ValueError in both cases.

Steps/Code to Reproduce

Running the same code with less data points causes a ValueError, otherwise an AttributeError.
Example:

from sklearn.cluster import Birch

X = [[1.30830774e+307, 6.02217328e+307],
     [1.54166067e+308, 1.75812744e+308],
     [5.57938866e+307, 4.13840113e+307],
     [1.36302835e+308, 1.07968131e+308],
     [1.58772669e+308, 1.19380571e+307],
     [2.20362426e+307, 1.58814671e+308],
     [1.06216028e+308, 1.14258583e+308],
     [7.18031911e+307, 1.69661213e+308],
     [7.91182553e+307, 5.12892426e+307],
     [5.58470885e+307, 9.13566765e+306],
     [1.22366243e+308, 8.29427922e+307]]

clusterer = Birch(branching_factor=10)
clusterer.fit(X)

Expected Results

A ValueError that specifies the range of allowed values like in other clustering algorithms:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Or a similar error like the ValueError from the case where data points are smaller than or equal to the branching factor:

ValueError: The condensed distance matrix must contain only finite values.

Actual Results

C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:189: RuntimeWarning: invalid value encountered in add
  dist_matrix += self.squared_norm_
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:304: RuntimeWarning: overflow encountered in add
  new_ls = self.linear_sum_ + nominee_cluster.linear_sum_
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:309: RuntimeWarning: invalid value encountered in double_scalars
  sq_radius = (new_ss + dot_product) / new_n + new_norm
C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
C:\Program Files\Python37\lib\site-packages\sklearn\utils\extmath.py:153: RuntimeWarning: overflow encountered in matmul
  ret = a @ b
C:\Program Files\Python37\lib\site-packages\sklearn\metrics\pairwise.py:310: RuntimeWarning: invalid value encountered in add
  distances += XX
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:81: RuntimeWarning: invalid value encountered in less
  node1_closer = node1_dist < node2_dist
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:294: RuntimeWarning: overflow encountered in add
  self.linear_sum_ += subcluster.linear_sum_
Traceback (most recent call last):
  File "C:\Users\thaar\PycharmProjects\sklearn-dev\birch_test.py", line 61, in <module>
    clusterer.fit(X)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 463, in fit
    return self._fit(X)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 510, in _fit
    self.root_.append_subcluster(new_subcluster1)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 158, in append_subcluster
    self.init_sq_norm_[n_samples] = subcluster.sq_norm_
AttributeError: '_CFSubcluster' object has no attribute 'sq_norm_'

Versions

System:
python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\thaar\PycharmProjects\sklearn-dev\venv\Scripts\python.exe
machine: Windows-10-10.0.18362-SP0

Python dependencies:
pip: 20.1.1
setuptools: 49.2.0
sklearn: 0.23.1
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.5
matplotlib: 3.2.1
joblib: 0.14.1
threadpoolctl: 2.0.0

Built with OpenMP: True

@thomasjpfan
Copy link
Member

X has values that are just barely under np.finfo(np.float64).max so it passes through check_array and the calculating in birch is doing calculations with these values that is going over the max.

One way to try to catch this is to catch the runtime warning and throw a more informative message. I am -0.5 on this because if we go down this route it would make sense to do it to catch it everywhere in the library. I think the RuntimeWarning is good enough to show what is going wrong.

@rth
Copy link
Member
rth commented Jul 21, 2020

The underlying issue to that error might be similar to #6172

Related discussion on behavior with very large floats in #17925

@thomasjpfan
Copy link
Member

Running the snippet on main gives:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

@rth
Copy link
Member
rth commented Feb 11, 2021

Yes, that's why it would be nice to have a minimal reproducible example that doesn't use huge floats. As it does look like there is a code path that triggers the above AttributError.

@sherbold
Copy link

Is the better error message a result of #18727? That at least resolves this issue. I will try to come up with a way to trigger the exception for smaller data, to possibly help with #6172.

@glemaitre glemaitre added Bug and removed Bug: triage labels Dec 22, 2021
@glemaitre
Copy link
Member
glemaitre commented Dec 22, 2021

Labelling as a bug but we should get a minimum reproducible example to be able to reproduce the error on Unix machine.

@nlahaye
Copy link
nlahaye commented Apr 29, 2022

I have run into this issue as well with much smaller numbers. Using this data, and running this code:

import dask.array as da
from sklearn.cluster import Birch
from sklearn.preprocessing import StandardScaler

data = da.from_zarr("bug_data.zarr")
print("Data Shape: ", data.shape)
print("Min, Max, Mean, StDev.: ", data.min().compute(), data.max().compute(), data.mean().compute(), data.std().compute())
scaler = StandardScaler()
scaler.fit(data)
data = scaler.transform(data)
print("Post-Scale - Min, Max, Mean, StDev.: ", data.min(), data.max(), data.mean(), data.std())
clustering = Birch(branching_factor=5, threshold=1e-5, n_clusters=None)
clustering.fit(data)

I run into this error:

Data Shape: (150000, 2000)
Min, Max, Mean, StDev.: -1.7028557 5.1015463 0.020574544 0.32617828
/home/nlahaye/.local/lib/python3.8/site-packages/dask/array/core.py:1650: FutureWarning: The numpy.may_share_memory function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
warnings.warn(
Post-Scale - Min, Max, Mean, StDev.: -5.8093686 7.8372993 4.7429404e-11 1.0000027
Traceback (most recent call last):
File "clustering_bug.py", line 25, in
clustering.fit(data)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 517, in fit
return self._fit(X, partial=False)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 562, in fit
split = self.root
.insert_cf_subcluster(subcluster)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 200, in insert_cf_subcluster
split_child = closest_subcluster.child
.insert_cf_subcluster(subcluster)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 200, in insert_cf_subcluster
split_child = closest_subcluster.child
.insert_cf_subcluster(subcluster)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 200, in insert_cf_subcluster
split_child = closest_subcluster.child
.insert_cf_subcluster(subcluster)
[Previous line repeated 3 more times]
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 221, in insert_cf_subcluster
self.update_split_subclusters(
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 179, in update_split_subclusters
self.init_sq_norm
[ind] = new_subcluster1.sq_norm

AttributeError: 'CFSubcluster' object has no attribute 'sq_norm'

For simplicity, I extracted this code and stripped away dask-ml wrappers from software I use for clustering, and have been able to successfully complete jobs with other datasets. This data is also a reduced set from a dataset that has many more samples.

Environment:
OS - CentOS-7
python - v3.8.2
dask - v2022.04.1
sklearn - v1.0.2

Please let me know if there is any other info you would like, etc.

Thanks!
Nick

@nlahaye
Copy link
nlahaye commented May 25, 2022

Thank you both for your help!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants