8000 Ensure divisions are plain scalars by TomAugspurger · Pull Request #11767 · dask/dask · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Ensure divisions are plain scalars #11767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

TomAugspurger
Copy link
Member
@TomAugspurger TomAugspurger commented Feb 20, 2025

This ensures that the .divisions of a Dask DataFrame are plain python scalars, rather than NumPy scalars.

This does involve a tolist on a potentially large-ish (but small enough to fit in memory) ndarray / series. Here's a little pytest-benchmark script to make sure we don't have a huge slowdown

import numpy as np
import pandas as pd
from dask.dataframe.io.io import sorted_division_locations


def run(seq, npartitions):
    sorted_division_locations(seq, npartitions)


def test_array_small_few(benchmark):
    benchmark(run, np.arange(12), 3)


def test_array_large_few(benchmark):
    benchmark(run, np.arange(1_000_000), 3)


def test_array_large_many(benchmark):
    benchmark(run, np.arange(1_000_000), 1_000)


def test_datetime_small_few(benchmark):
    benchmark(run, pd.date_range(start="2000-01-01", periods=3, freq="s"), 3)


def test_datetime_large_few(benchmark):
    benchmark(run, pd.date_range(start="2000-01-01", periods=3, freq="s"), 3)


def test_datetime_large_many(benchmark):
    benchmark(run, pd.date_range(start="2000-01-01", periods=3, freq="s"), 1_000)

I ran that on main and this branch with --benchmark-compre and here's the output. I think the way to read this is comparing the NOW row (this PR) to 0001_main (main) for each benchmark.

------------------------------------------------------------------------------------------------------- benchmark: 12 tests -------------------------------------------------------------------------------------------------------
Name (time in us)                                Min                    Max                   Mean              StdDev                 Median                   IQR            Outliers           OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_array_small_few (NOW)                    2.2500 (1.0)          46.8330 (4.39)          2.7033 (1.11)       0.9540 (5.58)          2.6670 (1.10)         0.0840 (1.01)      55;1437  369,921.4971 (0.90)      10350           1
test_array_small_few (0001_main)              2.2910 (1.02)         10.6660 (1.0)           2.4256 (1.0)        0.1709 (1.0)           2.4170 (1.0)          0.0830 (1.0)       161;204  412,275.9181 (1.0)       11000           1
test_datetime_large_many (NOW)                3.0000 (1.33)         92.2910 (8.65)          3.7458 (1.54)       1.4999 (8.78)          3.6250 (1.50)         0.2080 (2.51)     283;2404  266,964.7630 (0.65)      31538           1
test_datetime_large_few (NOW)                 3.0410 (1.35)         76.2920 (7.15)          3.6596 (1.51)       1.1164 (6.53)          3.6250 (1.50)         0.1250 (1.51)     233;3199  273,252.3657 (0.66)      25862           1
test_datetime_small_few (NOW)                 3.1250 (1.39)        130.9160 (12.27)         3.8374 (1.58)       1.9554 (11.44)         3.7500 (1.55)         0.0840 (1.01)      57;1189  260,590.4077 (0.63)       8203           1
test_datetime_large_few (0001_main)           4.8330 (2.15)        111.8750 (10.49)         5.7216 (2.36)       1.1700 (6.85)          5.6660 (2.34)         0.1670 (2.01)     469;2777  174,775.5574 (0.42)      24516           1
test_datetime_large_many (0001_main)          4.8750 (2.17)        155.2500 (14.56)         5.7569 (2.37)       1.2499 (7.31)          5.7080 (2.36)         0.1250 (1.51)     453;2856  173,703.6967 (0.42)      30151           1
test_datetime_small_few (0001_main)           4.9170 (2.19)        570.2500 (53.46)         6.3998 (2.64)       8.6409 (50.56)         5.7920 (2.40)         0.2510 (3.02)      115;932  156,254.8019 (0.38)      11517           1
test_array_large_few (0001_main)         20,779.0000 (>1000.0)  23,364.1670 (>1000.0)  21,978.8497 (>1000.0)  523.8696 (>1000.0)  21,950.7710 (>1000.0)    710.2920 (>1000.0)      11;1       45.4983 (0.00)         46           1
test_array_large_many (0001_main)        21,226.6660 (>1000.0)  23,601.9170 (>1000.0)  22,497.5148 (>1000.0)  590.0236 (>1000.0)  22,529.9580 (>1000.0)  1,079.9585 (>1000.0)      18;0       44.4494 (0.00)         45           1
test_array_large_few (NOW)               33,575.6250 (>1000.0)  37,392.3750 (>1000.0)  34,794.8907 (>1000.0)  849.8158 (>1000.0)  34,718.5840 (>1000.0)  1,013.3855 (>1000.0)       7;1       28.7399 (0.00)         29           1
test_array_large_many (NOW)              33,691.2080 (>1000.0)  37,366.6670 (>1000.0)  35,164.6895 (>1000.0)  948.5968 (>1000.0)  34,955.2920 (>1000.0)  1,392.1565 (>1000.0)       7;0       28.4376 (0.00)         31           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

There is a small performance hit for the array_large workloads (about 23us -> 35us). I think the datetime_large speedups might be from pandas.Index.__getitem__[scalar] being relatively slow compared to Python lists, so repeated getitems add up, but I haven't verified that. Either way, these numbers are small enough in absolute magnitude to not matter.

Note: yesterday while looking at this I thought I hit another failure, related to indexing. But I'm not able to find / reproduce that that today. DataFrame.divisions can be created in several ways depending on the code path being exercised, so I wouldn't be surprised if I missed some.

Closes #11765

@TomAugspurger
Copy link
Member Author

Note: yesterday while looking at this I thought I hit another failure, related to indexing. But I'm not able to find / reproduce that that today.

Maybe @fjetter preemptively fixed this for me in #11764. If so, thanks :)

Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      9 files  ± 0        9 suites  ±0   3h 31m 56s ⏱️ + 2m 19s
 17 774 tests + 2   16 561 ✅ + 2   1 213 💤 ±0  0 ❌ ±0 
159 040 runs  +18  146 893 ✅ +18  12 147 💤 ±0  0 ❌ ±0 

Results for commit 9da3603. ± Comparison against base commit 1110c97.

@phofl phofl merged commit f2b8287 into dask:main Feb 21, 2025
28 checks passed
@phofl
Copy link
Collaborator
phofl commented Feb 21, 2025

thanks!

@TomAugspurger TomAugspurger deleted the tom/fix/plain-divisions branch February 21, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

What scalar type is expected for DataFrame.divisions?
2 participants
0