Investigate indexing common query fields (e.g. type, location ref/start/end, and ids)

Feature description

Right now we dump all objects into a key/value table called vrs_objects which has the object id and a JSON blob. So far this has performed well in postgresql and duckdb, at least at small scales of hundreds of thousands of rows. This is unlikely to scale well up into the millions or hundreds of millions. Many queries we do (and foresee needing to do), using the current table schemas and query strings, will be doing full table scans, maybe multiple full table scans per query.

Postgresql supports indexing fields in JSONB columns directly:
https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING

DuckDB does not. We can create an additional column for each value from the JSON blob that we want to index. I'm not sure how much this will bloat the database storage size. It would involve doubling the byte size of every indexed value, and would involve creating null columns for rows which don't contain the JSON field being indexed.

e.g. Allele objects have a location: text field, which we do want to index, but the Location object those point to do not have a location field.

Another downside is it is also not transparent indexing, meaning when we want to query based on a field that is indexed we cannot refer to the field in the JSON column, we have to refer specifically to our manufactured column which has that field's value indexed. This can be fine but if we start adding a ton of indexes it gets harder to keep track of.

DuckDB also has generated columns which at first glance seems like it'd help a lot with maintaining the index columns
https://duckdb.org/docs/stable/sql/statements/create_table.html#generated-columns

Another route we can take is splitting types into their own tables, and extracting important values into columns in those type-specific tables. Depending on how many fields we pull up into columns, this makes the anyvar source code and table schemas more tightly coupled to a specific VRS schema.

Use case

lots of variants

Acceptance Criteria

performs well with lots of variants

Proposed solution

No response

Alternatives considered

No response

Implementation details

No response

Potential Impact

No response

Additional context

No response

Contribution

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature description

Use case

Acceptance Criteria

Proposed solution

Alternatives considered

Implementation details

Potential Impact

Additional context

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Feature description

Use case

Acceptance Criteria

Proposed solution

Alternatives considered

Implementation details

Potential Impact

Additional context

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions