Investigate indexing common query fields (e.g. type, location ref/start/end, and ids) · Issue #140 · biocommons/anyvar · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now we dump all objects into a key/value table called vrs_objects which has the object id and a JSON blob. So far this has performed well in postgresql and duckdb, at least at small scales of hundreds of thousands of rows. This is unlikely to scale well up into the millions or hundreds of millions. Many queries we do (and foresee needing to do), using the current table schemas and query strings, will be doing full table scans, maybe multiple full table scans per query.
DuckDB does not. We can create an additional column for each value from the JSON blob that we want to index. I'm not sure how much this will bloat the database storage size. It would involve doubling the byte size of every indexed value, and would involve creating null columns for rows which don't contain the JSON field being indexed.
e.g. Allele objects have a location: text field, which we do want to index, but the Location object those point to do not have a location field.
Another downside is it is also not transparent indexing, meaning when we want to query based on a field that is indexed we cannot refer to the field in the JSON column, we have to refer specifically to our manufactured column which has that field's value indexed. This can be fine but if we start adding a ton of indexes it gets harder to keep track of.
Another route we can take is splitting types into their own tables, and extracting important values into columns in those type-specific tables. Depending on how many fields we pull up into columns, this makes the anyvar source code and table schemas more tightly coupled to a specific VRS schema.
Use case
lots of variants
Acceptance Criteria
performs well with lots of variants
Proposed solution
No response
Alternatives considered
No response
Implementation details
No response
Potential Impact
No response
Additional context
No response
Contribution
None
The text was updated successfully, but these errors were encountered:
We can create an additional column for each value from the JSON blob that we want to index. I'm not sure how much this will bloat the database storage size
Is it an issue that the JSON structure is gonna differ between variation types (and between SequenceLocation and RLE-based alleles?)
Though it would also introduce a huge number of null-ed fields in rows which don't contain a value because it's not relevant to that type. I'm not sure what the storage overhead of that is but there's probably some.
Feature description
Right now we dump all objects into a key/value table called
vrs_objects
which has the object id and a JSON blob. So far this has performed well in postgresql and duckdb, at least at small scales of hundreds of thousands of rows. This is unlikely to scale well up into the millions or hundreds of millions. Many queries we do (and foresee needing to do), using the current table schemas and query strings, will be doing full table scans, maybe multiple full table scans per query.Postgresql supports indexing fields in JSONB columns directly:
https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING
DuckDB does not. We can create an additional column for each value from the JSON blob that we want to index. I'm not sure how much this will bloat the database storage size. It would involve doubling the byte size of every indexed value, and would involve creating
null
columns for rows which don't contain the JSON field being indexed.e.g.
Allele
objects have alocation: text
field, which we do want to index, but theLocation
object those point to do not have alocation
field.Another downside is it is also not transparent indexing, meaning when we want to query based on a field that is indexed we cannot refer to the field in the JSON column, we have to refer specifically to our manufactured column which has that field's value indexed. This can be fine but if we start adding a ton of indexes it gets harder to keep track of.
DuckDB also has generated columns which at first glance seems like it'd help a lot with maintaining the index columns
https://duckdb.org/docs/stable/sql/statements/create_table.html#generated-columns
Another route we can take is splitting types into their own tables, and extracting important values into columns in those type-specific tables. Depending on how many fields we pull up into columns, this makes the
anyvar
source code and table schemas more tightly coupled to a specific VRS schema.Use case
lots of variants
Acceptance Criteria
performs well with lots of variants
Proposed solution
No response
Alternatives considered
No response
Implementation details
No response
Potential Impact
No response
Additional context
No response
Contribution
None
The text was updated successfully, but these errors were encountered: