Description
What happens?
we have noticed inconsistent behaviour with small row groups and windowing
Having checked duckdb 1.0.0, 1.2.0 and 1.2.1, it seems to not exist in 1.0.0 and is worse in 1.2.1
it also seems very machine dependent, which makes me suspect threading
I could not get this to write parquet with tiny row groups in duckdb, so I used pandas instead :)
I would not expect any NULLS ever, and I do not believe you should expect anything out of order - but that is not 100% clear from the docs. clarity here would be great.
You could also do list(value order by value)
to be more explicit but again, not clear if this is necessary. It still reproduces if you do this however
To Reproduce
import pandas as pd
import random
import duckdb
print(duckdb.__version__)
pd.DataFrame(
[[i, random.random()] for i in range(25) for _ in range(59)],
columns=["id", "value"],
).sort_values("id").to_parquet("/tmp/repro2.parquet", row_group_size=60)
for _ in range(50):
a = duckdb.sql("""
SELECT
list(value) OVER (PARTITION BY id ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS values
FROM read_parquet('/tmp/repro2.parquet')
""").to_table("X")
null_rows = len(duckdb.sql("select * from X where values is null"))
unordered = len(duckdb.sql("select * from X where values is not null and values[1] != list_aggregate(values, 'min')"))
print("null = {} unordered = {}".format(null_rows, unordered))
duckdb.sql("drop table X;")
OS:
linux
DuckDB Version:
1.2.0/1
DuckDB Client:
python
Hardware:
No response
Full Name:
Dylan Yudaken
Affiliation:
Qubos
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have