-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counterintuitive AttributeError in Birch for very large numbers #17966
Comments
One way to try to catch this is to catch the runtime warning and throw a more informative message. I am -0.5 on this because if we go down this route it would make sense to do it to catch it everywhere in the library. I think the |
Running the snippet on main gives:
|
Yes, that's why it would be nice to have a minimal reproducible example that doesn't use huge floats. As it does look like there is a code path that triggers the above AttributError. |
Labelling as a bug but we should get a minimum reproducible example to be able to reproduce the error on Unix machine. |
I have run into this issue as well with much smaller numbers. Using this data, and running this code:
I run into this error: Data Shape: (150000, 2000) For simplicity, I extracted this code and stripped away dask-ml wrappers from software I use for clustering, and have been able to successfully complete jobs with other datasets. This data is also a reduced set from a dataset that has many more samples. Environment: Please let me know if there is any other info you would like, etc. Thanks! |
Thank you both for your help!! |
Describe the bug
Input data containing very large numbers causes overflows in the Birch algorithm, that manifest in different errors depending on the branching factor parameter. If the number of data points is smaller than or equal to the branching factor a ValueError is thrown in AgglomerativeClustering, but if this number exceeds the branching factor an AttributeError is thrown instead. Since both errors are caused by the input data I would expect to get a ValueError in both cases.
Steps/Code to Reproduce
Running the same code with less data points causes a ValueError, otherwise an AttributeError.
Example:
Expected Results
A ValueError that specifies the range of allowed values like in other clustering algorithms:
Or a similar error like the ValueError from the case where data points are smaller than or equal to the branching factor:
Actual Results
Versions
System:
python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\thaar\PycharmProjects\sklearn-dev\venv\Scripts\python.exe
machine: Windows-10-10.0.18362-SP0
Python dependencies:
pip: 20.1.1
setuptools: 49.2.0
sklearn: 0.23.1
numpy: 1.18.4
scipy: 1.4.1
Cython: None
pandas: 1.0.5
matplotlib: 3.2.1
joblib: 0.14.1
threadpoolctl: 2.0.0
Built with OpenMP: True
The text was updated successfully, but these errors were encountered: