8000 inconsistent behaviour with small parquet row groups · Issue #16652 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
inconsistent behaviour with small parquet row groups #16652
Closed
@DylanZA

Description

@DylanZA

What happens?

we have noticed inconsistent behaviour with small row groups and windowing
Having checked duckdb 1.0.0, 1.2.0 and 1.2.1, it seems to not exist in 1.0.0 and is worse in 1.2.1

it also seems very machine dependent, which makes me suspect threading

I could not get this to write parquet with tiny row groups in duckdb, so I used pandas instead :)

I would not expect any NULLS ever, and I do not believe you should expect anything out of order - but that is not 100% clear from the docs. clarity here would be great.
You could also do list(value order by value) to be more explicit but again, not clear if this is necessary. It still reproduces if you do this however

To Reproduce

import pandas as pd
import random
import duckdb
print(duckdb.__version__)
pd.DataFrame(
    [[i, random.random()] for i in range(25) for _ in range(59)],
    columns=["id", "value"],
).sort_values("id").to_parquet("/tmp/repro2.parquet", row_group_size=60)

for _ in range(50):
    a = duckdb.sql("""
    SELECT 
        list(value) OVER (PARTITION BY id ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS values
        FROM read_parquet('/tmp/repro2.parquet')
    """).to_table("X")
    null_rows = len(duckdb.sql("select * from X where values is null"))
    unordered = len(duckdb.sql("select * from X where values is not null and values[1] != list_aggregate(values, 'min')"))
    print("null = {} unordered = {}".format(null_rows, unordered))
    duckdb.sql("drop table X;")

OS:

linux

DuckDB Version:

1.2.0/1

DuckDB Client:

python

Hardware:

No response

Full Name:

Dylan Yudaken

Affiliation:

Qubos

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0