Open
Description
Describe the issue:
After setting index to timestamp, some loc based query works but string based querying causes the operation to hang unless we call optimize first
Minimal Complete Verifiable Example:
import dask.dataframe as dd
import pandas as pd
import random
def test_df() -> dd.DataFrame:
dfs = []
start_date = '2024-01-01'
end_date = '2024-01-31'
for num_rows in [2, 5, 10]:
df = pd.DataFrame(
{
'timestamp': pd.to_datetime(
pd.date_range(start_date, end_date, periods=num_rows),
),
'value1': random.choices(range(-20, 20), k=num_rows),
'value2': random.choices(range(-1000, 1000), k=num_rows),
},
)
dfs.append(
dd.from_pandas(df, npartitions=1),
)
return dd.concat(dfs)
df = test_df()
df = df.set_index('timestamp', npartitions=df.npartitions)
# df = df.optimize()
df.loc[df.index > '2024-01-15'].compute()
Anything else we need to know?:
Environment:
- Dask version: 2025.2.0
- Python version: 3.10
- Operating System: Mac Os and Linux Tested
- Install method (conda, pip, source): Pip