Description
Feature description
Right now we dump all objects into a key/value table called vrs_objects
which has the object id and a JSON blob. So far this has performed well in postgresql and duckdb, at least at small scales of hundreds of thousands of rows. This is unlikely to scale well up into the millions or hundreds of millions. Many queries we do (and foresee needing to do), using the current table schemas and query strings, will be doing full table scans, maybe multiple full table scans per query.
Postgresql supports indexing fields in JSONB columns directly:
https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING
DuckDB does not. We can create an additional column for each value from the JSON blob that we want to index. I'm not sure how much this will bloat the database storage size. It would involve doubling the byte size of every indexed value, and would involve creating null
columns for rows which don't contain the JSON field being indexed.
e.g. Allele
objects have a location: text
field, which we do want to index, but the Location
object those point to do not have a location
field.
Another downside is it is also not transparent indexing, meaning when we want to query based on a field that is indexed we cannot refer to the field in the JSON column, we have to refer specifically to our manufactured column which has that field's value indexed. This can be fine but if we start adding a ton of indexes it gets harder to keep track of.
DuckDB also has generated columns which at first glance seems like it'd help a lot with maintaining the index columns
https://duckdb.org/docs/stable/sql/statements/create_table.html#generated-columns
Another route we can take is splitting types into their own tables, and extracting important values into columns in those type-specific tables. Depending on how many fields we pull up into columns, this makes the anyvar
source code and table schemas more tightly coupled to a specific VRS schema.
Use case
lots of variants
Acceptance Criteria
performs well with lots of variants
Proposed solution
No response
Alternatives considered
No response
Implementation details
No response
Potential Impact
No response
Additional context
No response
Contribution
None