Counterintuitive AttributeError in Birch for very large numbers #17966

THaar50 · 2020-07-21T19:53:02Z

Describe the bug

Input data containing very large numbers causes overflows in the Birch algorithm, that manifest in different errors depending on the branching factor parameter. If the number of data points is smaller than or equal to the branching factor a ValueError is thrown in AgglomerativeClustering, but if this number exceeds the branching factor an AttributeError is thrown instead. Since both errors are caused by the input data I would expect to get a ValueError in both cases.

Steps/Code to Reproduce

Running the same code with less data points causes a ValueError, otherwise an AttributeError.
Example:

from sklearn.cluster import Birch

X = [[1.30830774e+307, 6.02217328e+307],
     [1.54166067e+308, 1.75812744e+308],
     [5.57938866e+307, 4.13840113e+307],
     [1.36302835e+308, 1.07968131e+308],
     [1.58772669e+308, 1.19380571e+307],
     [2.20362426e+307, 1.58814671e+308],
     [1.06216028e+308, 1.14258583e+308],
     [7.18031911e+307, 1.69661213e+308],
     [7.91182553e+307, 5.12892426e+307],
     [5.58470885e+307, 9.13566765e+306],
     [1.22366243e+308, 8.29427922e+307]]

clusterer = Birch(branching_factor=10)
clusterer.fit(X)

Expected Results

A ValueError that specifies the range of allowed values like in other clustering algorithms:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Or a similar error like the ValueError from the case where data points are smaller than or equal to the branching factor:

ValueError: The condensed distance matrix must contain only finite values.

Actual Results

C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:189: RuntimeWarning: invalid value encountered in add
  dist_matrix += self.squared_norm_
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:304: RuntimeWarning: overflow encountered in add
  new_ls = self.linear_sum_ + nominee_cluster.linear_sum_
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:309: RuntimeWarning: invalid value encountered in double_scalars
  sq_radius = (new_ss + dot_product) / new_n + new_norm
C:\Program Files\Python37\lib\site-packages\numpy\core\fromnumeric.py:90: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
C:\Program Files\Python37\lib\site-packages\sklearn\utils\extmath.py:153: RuntimeWarning: overflow encountered in matmul
  ret = a @ b
C:\Program Files\Python37\lib\site-packages\sklearn\metrics\pairwise.py:310: RuntimeWarning: invalid value encountered in add
  distances += XX
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:81: RuntimeWarning: invalid value encountered in less
  node1_closer = node1_dist < node2_dist
C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py:294: RuntimeWarning: overflow encountered in add
  self.linear_sum_ += subcluster.linear_sum_
Traceback (most recent call last):
  File "C:\Users\thaar\PycharmProjects\sklearn-dev\birch_test.py", line 61, in <module>
    clusterer.fit(X)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 463, in fit
    return self._fit(X)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 510, in _fit
    self.root_.append_subcluster(new_subcluster1)
  File "C:\Program Files\Python37\lib\site-packages\sklearn\cluster\_birch.py", line 158, in append_subcluster
    self.init_sq_norm_[n_samples] = subcluster.sq_norm_
AttributeError: '_CFSubcluster' object has no attribute 'sq_norm_'

Versions

System:
python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\thaar\PycharmProjects\sklearn-dev\venv\Scripts\python.exe
machine: Windows-10-10.0.18362-SP0

Python dependencies:
pip: 20.1.1
setuptools: 49.2.0
sklearn: 0.23.1
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.5
matplotlib: 3.2.1
joblib: 0.14.1
threadpoolctl: 2.0.0

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

thomasjpfan · 2020-07-21T20:35:36Z

X has values that are just barely under np.finfo(np.float64).max so it passes through check_array and the calculating in birch is doing calculations with these values that is going over the max.

One way to try to catch this is to catch the runtime warning and throw a more informative message. I am -0.5 on this because if we go down this route it would make sense to do it to catch it everywhere in the library. I think the RuntimeWarning is good enough to show what is going wrong.

rth · 2020-07-21T20:58:18Z

The underlying issue to that error might be similar to #6172

Related discussion on behavior with very large floats in #17925

thomasjpfan · 2021-02-11T18:12:10Z

Running the snippet on main gives:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

rth · 2021-02-11T18:14:47Z

Yes, that's why it would be nice to have a minimal reproducible example that doesn't use huge floats. As it does look like there is a code path that triggers the above AttributError.

sherbold · 2021-02-23T08:48:05Z

Is the better error message a result of #18727? That at least resolves this issue. I will try to come up with a way to trigger the exception for smaller data, to possibly help with #6172.

glemaitre · 2021-12-22T12:35:50Z

Labelling as a bug but we should get a minimum reproducible example to be able to reproduce the error on Unix machine.

nlahaye · 2022-04-29T03:40:01Z

I have run into this issue as well with much smaller numbers. Using this data, and running this code:

import dask.array as da
from sklearn.cluster import Birch
from sklearn.preprocessing import StandardScaler

data = da.from_zarr("bug_data.zarr")
print("Data Shape: ", data.shape)
print("Min, Max, Mean, StDev.: ", data.min().compute(), data.max().compute(), data.mean().compute(), data.std().compute())
scaler = StandardScaler()
scaler.fit(data)
data = scaler.transform(data)
print("Post-Scale - Min, Max, Mean, StDev.: ", data.min(), data.max(), data.mean(), data.std())
clustering = Birch(branching_factor=5, threshold=1e-5, n_clusters=None)
clustering.fit(data)

I run into this error:

Data Shape: (150000, 2000)
Min, Max, Mean, StDev.: -1.7028557 5.1015463 0.020574544 0.32617828
/home/nlahaye/.local/lib/python3.8/site-packages/dask/array/core.py:1650: FutureWarning: The numpy.may_share_memory function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
warnings.warn(
Post-Scale - Min, Max, Mean, StDev.: -5.8093686 7.8372993 4.7429404e-11 1.0000027
Traceback (most recent call last):
File "clustering_bug.py", line 25, in
clustering.fit(data)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 517, in fit
return self._fit(X, partial=False)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/_birch.py", line 562, in fit
split = self.root.insert_cf_subcluster(subcluster)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 200, in insert_cf_subcluster
split_child = closest_subcluster.child.insert_cf_subcluster(subcluster)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 200, in insert_cf_subcluster
split_child = closest_subcluster.child.insert_cf_subcluster(subcluster)
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 200, in insert_cf_subcluster
split_child = closest_subcluster.child.insert_cf_subcluster(subcluster)
[Previous line repeated 3 more times]
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 221, in insert_cf_subcluster
self.update_split_subclusters(
File "/home/nlahaye/.local/lib/python3.8/site-packages/sklearn/cluster/birch.py", line 179, in update_split_subclusters
self.init_sq_norm[ind] = new_subcluster1.sq_norm
AttributeError: 'CFSubcluster' object has no attribute 'sq_norm'

For simplicity, I extracted this code and stripped away dask-ml wrappers from software I use for clustering, and have been able to successfully complete jobs with other datasets. This data is also a reduced set from a dataset that has many more samples.

Environment:
OS - CentOS-7
python - v3.8.2
dask - v2022.04.1
sklearn - v1.0.2

Please let me know if there is any other info you would like, etc.

Thanks!
Nick

nlahaye · 2022-05-25T15:24:05Z

Thank you both for your help!!

THaar50 added the Bug: triage label Jul 21, 2020

cmarmo added the module:cluster label Oct 19, 2020

glemaitre added Bug and removed Bug: triage labels Dec 22, 2021

nlahaye mentioned this issue May 3, 2022

AttributeError in Birch for StandardScaled values #23269

Closed

jeremiedbb mentioned this issue May 25, 2022

FIX attribute error is BIRCH #23395

Merged

lesteve closed this as completed in #23395 May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counterintuitive AttributeError in Birch for very large numbers #17966

Counterintuitive AttributeError in Birch for very large numbers #17966

Counterintuitive AttributeError in Birch for very large numbers #17966

Counterintuitive AttributeError in Birch for very large numbers #17966

Comments

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions