Ensure divisions are plain scalars #11767

TomAugspurger · 2025-02-20T19:45:07Z

This ensures that the .divisions of a Dask DataFrame are plain python scalars, rather than NumPy scalars.

This does involve a tolist on a potentially large-ish (but small enough to fit in memory) ndarray / series. Here's a little pytest-benchmark script to make sure we don't have a huge slowdown

import numpy as np
import pandas as pd
from dask.dataframe.io.io import sorted_division_locations


def run(seq, npartitions):
    sorted_division_locations(seq, npartitions)


def test_array_small_few(benchmark):
    benchmark(run, np.arange(12), 3)


def test_array_large_few(benchmark):
    benchmark(run, np.arange(1_000_000), 3)


def test_array_large_many(benchmark):
    benchmark(run, np.arange(1_000_000), 1_000)


def test_datetime_small_few(benchmark):
    benchmark(run, pd.date_range(start="2000-01-01", periods=3, freq="s"), 3)


def test_datetime_large_few(benchmark):
    benchmark(run, pd.date_range(start="2000-01-01", periods=3, freq="s"), 3)


def test_datetime_large_many(benchmark):
    benchmark(run, pd.date_range(start="2000-01-01", periods=3, freq="s"), 1_000)

I ran that on main and this branch with --benchmark-compre and here's the output. I think the way to read this is comparing the NOW row (this PR) to 0001_main (main) for each benchmark.

------------------------------------------------------------------------------------------------------- benchmark: 12 tests -------------------------------------------------------------------------------------------------------
Name (time in us)                                Min                    Max                   Mean              StdDev                 Median                   IQR            Outliers           OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_array_small_few (NOW)                    2.2500 (1.0)          46.8330 (4.39)          2.7033 (1.11)       0.9540 (5.58)          2.6670 (1.10)         0.0840 (1.01)      55;1437  369,921.4971 (0.90)      10350           1
test_array_small_few (0001_main)              2.2910 (1.02)         10.6660 (1.0)           2.4256 (1.0)        0.1709 (1.0)           2.4170 (1.0)          0.0830 (1.0)       161;204  412,275.9181 (1.0)       11000           1
test_datetime_large_many (NOW)                3.0000 (1.33)         92.2910 (8.65)          3.7458 (1.54)       1.4999 (8.78)          3.6250 (1.50)         0.2080 (2.51)     283;2404  266,964.7630 (0.65)      31538           1
test_datetime_large_few (NOW)                 3.0410 (1.35)         76.2920 (7.15)          3.6596 (1.51)       1.1164 (6.53)          3.6250 (1.50)         0.1250 (1.51)     233;3199  273,252.3657 (0.66)      25862           1
test_datetime_small_few (NOW)                 3.1250 (1.39)        130.9160 (12.27)         3.8374 (1.58)       1.9554 (11.44)         3.7500 (1.55)         0.0840 (1.01)      57;1189  260,590.4077 (0.63)       8203           1
test_datetime_large_few (0001_main)           4.8330 (2.15)        111.8750 (10.49)         5.7216 (2.36)       1.1700 (6.85)          5.6660 (2.34)         0.1670 (2.01)     469;2777  174,775.5574 (0.42)      24516           1
test_datetime_large_many (0001_main)          4.8750 (2.17)        155.2500 (14.56)         5.7569 (2.37)       1.2499 (7.31)          5.7080 (2.36)         0.1250 (1.51)     453;2856  173,703.6967 (0.42)      30151           1
test_datetime_small_few (0001_main)           4.9170 (2.19)        570.2500 (53.46)         6.3998 (2.64)       8.6409 (50.56)         5.7920 (2.40)         0.2510 (3.02)      115;932  156,254.8019 (0.38)      11517           1
test_array_large_few (0001_main)         20,779.0000 (>1000.0)  23,364.1670 (>1000.0)  21,978.8497 (>1000.0)  523.8696 (>1000.0)  21,950.7710 (>1000.0)    710.2920 (>1000.0)      11;1       45.4983 (0.00)         46           1
test_array_large_many (0001_main)        21,226.6660 (>1000.0)  23,601.9170 (>1000.0)  22,497.5148 (>1000.0)  590.0236 (>1000.0)  22,529.9580 (>1000.0)  1,079.9585 (>1000.0)      18;0       44.4494 (0.00)         45           1
test_array_large_few (NOW)               33,575.6250 (>1000.0)  37,392.3750 (>1000.0)  34,794.8907 (>1000.0)  849.8158 (>1000.0)  34,718.5840 (>1000.0)  1,013.3855 (>1000.0)       7;1       28.7399 (0.00)         29           1
test_array_large_many (NOW)              33,691.2080 (>1000.0)  37,366.6670 (>1000.0)  35,164.6895 (>1000.0)  948.5968 (>1000.0)  34,955.2920 (>1000.0)  1,392.1565 (>1000.0)       7;0       28.4376 (0.00)         31           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

There is a small performance hit for the array_large workloads (about 23us -> 35us). I think the datetime_large speedups might be from pandas.Index.__getitem__[scalar] being relatively slow compared to Python lists, so repeated getitems add up, but I haven't verified that. Either way, these numbers are small enough in absolute magnitude to not matter.

Note: yesterday while looking at this I thought I hit another failure, related to indexing. But I'm not able to find / reproduce that that today. DataFrame.divisions can be created in several ways depending on the code path being exercised, so I wouldn't be surprised if I missed some.

Closes #11765

Closes dask#11765

TomAugspurger · 2025-02-20T20:06:34Z

Note: yesterday while looking at this I thought I hit another failure, related to indexing. But I'm not able to find / reproduce that that today.

Maybe @fjetter preemptively fixed this for me in #11764. If so, thanks :)

github-actions · 2025-02-20T20:27:31Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

9 files ± 0 9 suites ±0 3h 31m 56s ⏱️ + 2m 19s
17 774 tests + 2 16 561 ✅ + 2 1 213 💤 ±0 0 ❌ ±0
159 040 runs +18 146 893 ✅ +18 12 147 💤 ±0 0 ❌ ±0

Results for commit 9da3603. ± Comparison against base commit 1110c97.

phofl · 2025-02-21T10:16:55Z

thanks!

Ensure divisions are plain scalars

9da3603

Closes dask#11765

TomAugspurger mentioned this pull request Feb 20, 2025

Failure on dask-main rapidsai/dask-upstream-testing#9

Closed

phofl approved these changes Feb 21, 2025

View reviewed changes

phofl merged commit f2b8287 into dask:main Feb 21, 2025
28 checks passed

TomAugspurger deleted the tom/fix/plain-divisions branch February 21, 2025 14:29

TomAugspurger mentioned this pull request Feb 21, 2025

Fix sorted_divisions_locations with duplicates #11773

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Ensure divisions are plain scalars #11767

Ensure divisions are plain scalars #11767

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ensure divisions are plain scalars #11767

Ensure divisions are plain scalars #11767

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!