Evidence Aggregator

The Evidence Aggregator is a collaboration between the Broad Institute, Microsoft Research Health Futures, and the Centre for Population Genomics.

This project leverages Generative AI to drastically accelerate targeted information retrieval and reasoning over large text knowledge bases (e.g. PubMed). We focus here on rare disease diagnostics, where retrieval of relevant, current information from the scientific literature is particularly challenging. The Evidence Aggregator is an analytic Copilot that can query the entirety of PubMed, highlighting relevant literature for a given gene of interest, detailing information including described genetic variants, the patients they impact, and the phenotypic consequences. This approach is readily scalable to other applications and the modularity of the codebase readily supports this.

Using Evidence Aggregator

Evidence Aggregator functionality is exposed as a library of independent pipeline components, implemented as Python functions or classes, for querying, filtering, and aggregating genetic variant publication evidence. These components are designed to be assembled in various ways into a pipeline "app" that can be run to generate some sort of concrete analysis or output - for example, to take a set of gene names as input and output a table of PubMed publication data referencing variants in those genes. The repo contains a number of existing pipeline definitions that can be run as-is or modified/reconfigured to produce different results.

Pipeline apps

An "app" is any Python class that implements the lib.evagg.IEvAggApp protocol (effectively an execute method) to be instantiated and run via the run_evagg_app Linux command-line entrypoint. An app is defined in a yaml specification file as an ordered dictionary of key/value pairs describing the full class/parameter component hierarchy to be instantiated before execute is called. The abstract format of a yaml spec is as follows:

# Reusable resource definitions.
resource_1: <value or spec sub-dictionary>
...
resource_n: <value or spec sub-dictionary>

# Application definition.
di_factory: <fully-qualified class name, factory method, or yaml file path>
param_1: <value or spec sub-dictionary>
...
param_n: <value or spec sub-dictionary>

The lib.di module is responsible for parsing, instantiating, and returning an arbitrary app object from an app spec when run_evagg_app is called. For a given spec, it first collects, in file order, the top-level name/value pairs - "resources" are entries occurring before the di_factory key, and "parameters" are those occurring after. It then resolves any string value of the form "{{resource_name}}" by looking resource_name up in the "resources" collection - this allows instantiation of singleton objects earlier in the spec that may be reused at multiple places later in the spec. Finally, the di module instantiates and returns the top-level app object represented by the spec by invoking the di_factory: entrypoint value, passing in the resolved/collected parameters as named keyword arguments. If any resource or parameter value in the spec consists of a sub-dictionary with its own di_factory key, that value is first resolved to a runtime object, recursively, following the same mechanism just described for the top-level spec. Object hierarchies of this sort may be arbitrarily deep.

The existing app/sub-object spec files for defining various runnable pipelines are found in lib/config/.

Quickstart

The following setup steps will allow you to run a simple pipeline app at the Linux command-line that outputs (fabricated) sample results without relying on any external resources. (Each step is described in greater detail for increased pipeline functionality in SETUP.md.)

Install software prerequisites: python, git, miniconda/libmamba
Clone this repository: git clone https://github.com/microsoft/healthfutures-evagg && cd healthfutures-evagg
Build a conda environment: conda env create -f environment.yml && conda activate evagg
Install poetry dependencies: poetry install

Then run the sample pipeline using the following command. It will output a few lines of placeholder publication "evidence" to standard output.

run_evagg_app lib/config/sample_config.yaml

Proceed to SETUP.md to set up external dependencies and perform a full-featured example execution of the pipeline against live resources.

Contributing

See CONTRIBUTING.md for additional detail on guidelines for contribution.

Code organization

The repository contains the following subdirectories:

root
|-- data: sample and reference data
|-- lib: source code for scripts and core libraries
|-- scripts: helper scripts for pre- and post-processing of results
|-- test: pytest unit tests for core libraries
|-- .out [generated]: default root directory for pipeline run logging and output
|-- .ref [generated]: default root directory for localized pipeline resources

Pre-PR checks

Before submitting any PR for review, please verify that linting checks pass (make lint) and that tests pass with acceptable coverage for any new code (make test). All pre-PR checks can be run in a single command via make ci.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Responsible AI Transparency

Use of this code

The Evidence Aggregator is intended to be one tool within a genomic analyst's toolkit to review literature related to a variant of interest. It is the user's responsibility to verify the accuracy of the information returned by the Evidence Aggregator. The Evidence Aggregator is for research use only. Users must adhere to the Microsoft Generative AI Services Code of Conduct.

The Evidence Aggregator is not designed, intended, or made available for use in the diagnosis, prevention, mitigation, or treatment of a disease or medical condition nor to perform any medical function and the performance of the Evidence Aggregator for such purposes has not been established. You bear sole responsibility for any use of the Evidence Aggregator, including incorporation into any product intended for a medical purpose.

Limitations

The Evidence Aggregator literature discovery is limited to open-access publications with permissive licenses from the PubMed Central (PMC) Open Access Subset of journal articles. Information returned by the Evidence aggregator should not be considered exhaustive.

Performance was not optimized for genes with extensive evidence for definitive gene-disease relationships, but for genes with moderate, limited, or no known gene-disease relationship as annotated in Gene Curation Coalition (GenCC) http://www.thegencc.org [July 2024].

The Evidence Aggregator uses the capabilities of generative AI for both publication foraging and information summarization. Performance of the Evidence Aggregator is limited to the capabilities of the underlying model.

The design and assessment of the Evidence Aggregator were conducted in English. At present, the Evidence Aggregator is limited to processing inputs and generating outputs in the English language.

Attributions

National Library of Medicine (US), National Center for Biotechnology Information; [1988]/[cited September 2024]. https://www.ncbi.nlm.nih.gov/

Harrison et al., Ensembl 2024, Nucleic Acids Research, 2024, 52(D1):D891–D899. PMID: 37953337. https://doi.org/10.1093/nar/gkad1049

Lefter M et al. (2021). Mutalyzer 2: Next Generation HGVS Nomenclature Checker. Bioinformatics, 2021 Sep 15; 37(28):2811-7

The Evidence Aggregator uses the Human Phenotype Ontology (HPO version dependent on user environment build and pipeline execution date/time). http://www.human-phenotype-ontology.org

The Evidence Aggregator team thanks the Gene Curation Coalition (GenCC) for providing curated content referenced during development. GenCC’s curated content was obtained at http://www.thegencc.org [July 2024] and includes contributions from the following organizations: ClinGen, Ambry Genetics, Franklin by Genoox, G2P, Genomics England PanelApp, Illumina, Invitae, King Faisal Specialist Hospital and Research Center, Laboratory for Molecular Medicine, Myriad Women’s Health, Orphanet, PanelApp Australia.

Environment dependencies may be found in environment.yml.

Name		Name	Last commit message	Last commit date
Latest commit History 1,251 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.vscode		.vscode
data		data
lib		lib
scripts		scripts
test		test
.eslintignore		.eslintignore
.flake8		.flake8
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SETUP.md		SETUP.md
environment.yml		environment.yml
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
template.env		template.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evidence Aggregator

Using Evidence Aggregator

Pipeline apps

Quickstart

Contributing

Code organization

Pre-PR checks

Trademarks

Responsible AI Transparency

Use of this code

Limitations

Attributions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

populationgenomics/healthfutures-evagg

Folders and files

Latest commit

History

Repository files navigation

Evidence Aggregator

Using Evidence Aggregator

Pipeline apps

Quickstart

Contributing

Code organization

Pre-PR checks

Trademarks

Responsible AI Transparency

Use of this code

Limitations

Attributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages