8000 Investigate indexing common query fields (e.g. type, location ref/start/end, and ids) · Issue #140 · biocommons/anyvar · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Investigate indexing common query fields (e.g. type, location ref/start/end, and ids) #140
Open
@theferrit32

Description

@theferrit32

Feature description

Right now we dump all objects into a key/value table called vrs_objects which has the object id and a JSON blob. So far this has performed well in postgresql and duckdb, at least at small scales of hundreds of thousands of rows. This is unlikely to scale well up into the millions or hundreds of millions. Many queries we do (and foresee needing to do), using the current table schemas and query strings, will be doing full table scans, maybe multiple full table scans per query.

Postgresql supports indexing fields in JSONB columns directly:
https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING

DuckDB does not. We can create an additional column for each value from the JSON blob that we want to index. I'm not sure how much this will bloat the database storage size. It would involve doubling the byte size of every indexed value, and would involve creating null columns for rows which don't contain the JSON field being indexed.

e.g. Allele objects have a location: text field, which we do want to index, but the Location object those point to do not have a location field.

Another downside is it is also not transparent indexing, meaning when we want to query based on a field that is indexed we cannot refer to the field in the JSON column, we have to refer specifically to our manufactured column which has that field's value indexed. This can be fine but if we start adding a ton of indexes it gets harder to keep track of.

DuckDB also has generated columns which at first glance seems like it'd help a lot with maintaining the index columns
https://duckdb.org/docs/stable/sql/statements/create_table.html#generated-columns

Another route we can take is splitting types into their own tables, and extracting important values into columns in those type-specific tables. Depending on how many fields we pull up into columns, this makes the anyvar source code and table schemas more tightly coupled to a specific VRS schema.

Use case

lots of variants

Acceptance Criteria

performs well with lots of variants

Proposed solution

No response

Alternatives considered

No response

Implementation details

No response

Potential Impact

No response

Additional context

No response

Contribution

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0