8000 Investigate indexing common query fields (e.g. type, location ref/start/end, and ids) · Issue #140 · biocommons/anyvar · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Investigate indexing common query fields (e.g. type, location ref/start/end, and ids) #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
theferrit32 opened this issue Apr 23, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@theferrit32
Copy link
Collaborator

Feature description

Right now we dump all objects into a key/value table called vrs_objects which has the object id and a JSON blob. So far this has performed well in postgresql and duckdb, at least at small scales of hundreds of thousands of rows. This is unlikely to scale well up into the millions or hundreds of millions. Many queries we do (and foresee needing to do), using the current table schemas and query strings, will be doing full table scans, maybe multiple full table scans per query.

Postgresql supports indexing fields in JSONB columns directly:
https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING

DuckDB does not. We can create an additional column for each value from the JSON blob that we want to index. I'm not sure how much this will bloat the database storage size. It would involve doubling the byte size of every indexed value, and would involve creating null columns for rows which don't contain the JSON field being indexed.

e.g. Allele objects have a location: text field, which we do want to index, but the Location object those point to do not have a location field.

Another downside is it is also not transparent indexing, meaning when we want to query based on a field that is indexed we cannot refer to the field in the JSON column, we have to refer specifically to our manufactured column which has that field's value indexed. This can be fine but if we start adding a ton of indexes it gets harder to keep track of.

DuckDB also has generated columns which at first glance seems like it'd help a lot with maintaining the index columns
https://duckdb.org/docs/stable/sql/statements/create_table.html#generated-columns

Another route we can take is splitting types into their own tables, and extracting important values into columns in those type-specific tables. Depending on how many fields we pull up into columns, this makes the anyvar source code and table schemas more tightly coupled to a specific VRS schema.

Use case

lots of variants

Acceptance Criteria

performs well with lots of variants

Proposed solution

No response

Alternatives considered

No response

Implementation details

No response

Potential Impact

No response

Additional context

No response

Contribution

None

@theferrit32 theferrit32 added the enhancement New feature or request label Apr 23, 2025
@jsstevenson
Copy link
Contributor

We can create an additional column for each value from the JSON blob that we want to index. I'm not sure how much this will bloat the database storage size

Is it an issue that the JSON structure is gonna differ between variation types (and between SequenceLocation and RLE-based alleles?)

@theferrit32
Copy link
Collaborator Author

Was more thinking about like if we take this record:

{
    "vrs_id": "ga4gh:SL.kpZezgkbDjQ9EJeNY8bpUeIHCfjN7AIq",
    "vrs_object": {
        "id": "ga4gh:SL.kpZezgkbDjQ9EJeNY8bpUeIHCfjN7AIq",
        "type": "SequenceLocation",
        "digest": "kpZezgkbDjQ9EJeNY8bpUeIHCfjN7AIq",
        "sequenceReference": {
            "type": "SequenceReference",
            "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO"
        },
        "start": 168104561,
        "end": 168104562
    }
}

And pull up some values into columns for the purpose of indexing them:

{
    "vrs_id": "ga4gh:SL.kpZezgkbDjQ9EJeNY8bpUeIHCfjN7AIq",
    "vrs_object": {
        "id": "ga4gh:SL.kpZezgkbDjQ9EJeNY8bpUeIHCfjN7AIq",
        "type": "SequenceLocation",
        "digest": "kpZezgkbDjQ9EJeNY8bpUeIHCfjN7AIq",
        "sequenceReference": {
            "type": "SequenceReference",
            "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO"
        },
        "start": 168104561,
        "end": 168104562
    },
    "type": "SequenceLocation",
    "start": 168104561,
    "end": 168104562,
    "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
}

We'd be storing the bytes for "SequenceLocation", 168104561, 168104562, and "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO" twice

@theferrit32
Copy link
Collaborator Author

Though it would also introduce a huge number of null-ed fields in rows which don't contain a value because it's not relevant to that type. I'm not sure what the storage overhead of that is but there's probably some.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants
0