Abstract
Motivation
Genome-wide association studies can reveal important genotype–phenotype associations; however, data quality and interpretability issues must be addressed. For drug discovery scientists seeking to prioritize targets based on the available evidence, these issues go beyond the single study.
Results
Here, we describe rational ranking, filtering and interpretation of inferred gene–trait associations and data aggregation across studies by leveraging existing curation and harmonization efforts. Each gene–trait association is evaluated for confidence, with scores derived solely from aggregated statistics, linking a protein-coding gene and phenotype. We propose a method for assessing confidence in gene–trait associations from evidence aggregated across studies, including a bibliometric assessment of scientific consensus based on the iCite relative citation ratio, and meanRank scores, to aggregate multivariate evidence.
This method, intended for drug target hypothesis generation, scoring and ranking, has been implemented as an analytical pipeline, available as open source, with public datasets of results, and a web application designed for usability by drug discovery scientists.
Availability and implementation
Web application, datasets and source code via https://unmtid-shinyapps.net/tiga/.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Over the two decades since the first draft human genome was published, dramatic progress has been achieved in foundational biology with translational benefits to medicine and human health. Genome-wide association studies (GWAS) contribute to this progress by inferring associations between genomic variations and phenotypic traits (Bossé and Amos, 2018; Rusu et al., 2017). These associations are correlations that may or may not be causal. While GWAS can reveal important genotype–phenotype associations, data quality and interpretability must be addressed (Gallagher and Chen-Plotkin, 2018; Lambert and Black, 2012; Marigorta et al., 2018; Visscher et al., 2017). For drug discovery scientists seeking to prioritize targets based on evidence from multiple studies, quality and interpretability issues are broader than for GWAS specialists. For this use case, GWAS are one of several evidence sources to be explored and considered, and interpretability must be in terms of genes corresponding to plausible targets, and traits corresponding to diseases of interest.
Single-nucleotide variants (SNVs) are the fundamental unit of genomic variation, and the term single-nucleotide polymorphism (SNP) refers to SNVs identified as common sites of variation relative to a reference genome, and measured by microarray or sequencing technologies. The NHGRI-EBI GWAS Catalog (Buniello et al., 2019)—hereafter ‘Catalog’—curates associations between SNPs and traits from GWAS publications, shares metadata and summary data, standardizes heterogeneous submissions, maps formats and harmonizes content, mitigating widespread data and metadata issues according to FAIR (Findable, Accessible, Interoperable and Reusable) principles (Wilkinson et al., 2016). These challenges are exacerbated by rapid advances in experimental and computational methodology. As de facto GWAS registrar, the Catalog interacts directly with investigators and accepts submissions of summary statistic data in advance of publication. Proposing and maintaining metadata standards the Catalog advocates and advances FAIRness in GWAS, for the benefit of the community. The Catalog addresses many difficulties due to content and format heterogeneity, but there are continuing difficulties and limitations both from lack of reporting standards and the variability of experimental methodology and diagnostic criteria.
Other GWAS data collections include the Genome-Wide Repository of Associations between SNPs and phenotypes, GRASP (Eicher et al., 2015) and The Framingham Heart Study, which employs non-standard phenotypes and some content from the Catalog (not updated since 2015). GWASdb (Li et al., 2016) integrates over 40 data sources in addition to the Catalog, includes less significant variants to address a variety of use cases, and has been maintained continually since 2011. GWAS central, continually updated through 2019, includes less significant associations and provides tools for a variety of exploration modes based on Catalog data, but is not freely available for download. PheGenI (Ramos et al., 2014) integrates Catalog data with other NCBI datasets and tools. Others integrate GWAS with additional data [e.g. pathways, expression, linkage disequilibrium (LD)] to associate traits or diseases with genes (Greene et al., 2015; Li et al., 2018; Pallejà et al., 2012; Shen et al., 2017; Wainberg et al., 2019). Each of these resources offers unique value and features. For this use case, the Catalog is the logical choice, given its applicability and commitment to expert curation, data standards, support and maintenance.
Here, we describe TIGA (Target Illumination GWAS Analytics), an application for illuminating understudied drug targets. TIGA enables ranking, filtering and interpretation of inferred gene–trait associations aggregated across studies from the Catalog. Each inferred gene-to-trait association is evaluated for confidence, with scores derived solely from evidence aggregated across studies, linking a phenotypic trait and protein-coding gene, mapped from SNP variation. TIGA uses the relative citation ratio, RCR (Hutchins et al., 2016), a bibliometric statistic from iCite (Hutchins et al., 2019). TIGA does not index the full corpus of GWAS associations, but focuses on the strongest associations at the protein-coding gene level instead, filtered by disease areas that are relevant to drug discovery. For instance, GWAS for highly polygenic traits are considered less likely to illuminate druggable genes. Here, we describe the web application and its interpretability for non-GWAS specialists. We discuss TIGA as an application of data science for scientific consensus and interpretability, including statistical and semantical challenges. Code and data are available under BSD-2-Clause license from https://github.com/unmtransinfo/tiga-gwas-explorer.
2 Materials and methods
2.1 NHGRI-EBI GWAS Catalog preprocessing
The February 12, 2021, release of the Catalog references 11 671 studies and 4865 PubMed IDs. The curated associations include 8235 studies and 2706 EFO (experimental factor ontology) mapped traits. After filtering studies to require (i) EFO-mapped trait; (ii) P-value < 5 × 10−8; (iii) reported effect size (odds ratio or beta); and (iv) mapped protein-coding gene, we obtained 4118 studies, 1521 traits and 10 264 genes. For consistency between studies, only genes mapped by the Ensembl pipeline (The NHGRI-EBI GWAS Catalog FAQ) for genomics annotations were considered (not author-reported). Figures 1 and 2 illustrate the growth of GWAS research as measured by counts of studies and subjects.
2.2 RCRAS = RCR aggregated score
The purpose of TIGA is to evaluate the evidence for a gene–trait association, by aggregating multiple studies and their corresponding publications. The iCite RCR (Hutchins et al., 2016) is a bibliometric statistic designed to evaluate the impact of an individual publication (in contrast to the journal impact factor). By field- and time-normalizing per-publication citation counts, the RCR measures evolving impact, in effect a proxy for scientific consensus. Hence by aggregating RCRs we seek a corresponding measure of scientific consensus for associations—see (1).
(1) |
where study = GWAS (study accession), gc = gene count (in study), pub = publication (PubMed ID) and sc = study count (in publication).
The log2() function is used with the assertion that differences of evidence depend on relative, rather than absolute differences in RCR. Division by sc effects a partial count for publications associated with multiple studies. Since RCR ≥ 0, log2(RCR + 1) ≥ 0 and intuitively, when RCR = 1 and sc = 1, log2(RCR + 1) = 1. Similarly, division by gc reflects a partial count since studies may implicate multiple genes. This approach is informed by bibliometric methodology, including fractional publication counts, as described elsewhere (Cannon et al., 2017). For recent publications lacking RCR, we used the global median as an estimated prior. Computed thus, RCRAS extends RCR with similar logic, providing a rational bibliometric measure of evidence for scoring and ranking gene–trait associations.
2.3 Association weighting by SNP–gene distance
Mapping genomic variation of single nucleotides (SNPs) to genes is a challenging area of active research (Lamparter et al., 2016; Liu et al., 2010; Mishra and Macgregor, 2015). This project does not contribute to mapping methodology. Rather, TIGA employs mappings provided by the Catalog between GWAS SNPs and genes, generated by an Ensembl pipeline that ‘adds additional SNP-specific information associated with the rsID extracted … This information is retrieved using the Ensembl API and the source of the data is both Ensembl and NCBI’ (The NHGRI-EBI GWAS Catalog GWAS Catalog Curation). It is important to note that this method is unbiased and derived from experimental data and the current human reference genome. TIGA aggregates SNP-trait associations, assessing evidence for gene–trait associations, based on these understandings:
SNPs within a gene are more strongly associated than SNPs upstream or downstream.
Strength of association decreases with distance, or more rigorously stated, the probability of LD between an SNP and protein-coding gene decreases with genomic physical distance. Accordingly, we employ an inverse exponential scoring function, consistent with LD measure (Δ) and coefficient of decay (β) by Wang et al. (2006).
This function, used to weight N_snp to compute a distance-weighted SNP count N_snpw, is plotted together with the observed frequencies of mapped gene distances in Supplementary Figure S1, to illustrate how the extant evidence is weighted—see (2).
(2) |
where d = distance in base pairs and k = ‘half-life distance’ (50k).
2.4 Multivariate ranking
Multivariate ranking is a well-studied problem which needs to be addressed for ranking GWAS associations. We evaluated two approaches, namely non-parametric μ scores (Wittkowski and Song, 2010) and meanRank, and chose the latter based on benchmark test performance. meanRank aggregates ranks instead of variables directly, avoiding the need for ad hoc parameters. Variable ties imply rank ties, with missing data ranked last. We normalize scoring to (0,100] defining meanRankScore as follows.
Variables of merit used for scoring and ranking gene–trait associations:
N_snpw: N_snp weighted by distance inverse exponential described above.
pVal_mLog: max[-Log(pValue)] supporting gene–trait association.
RCRAS: RCR aggregated score (iCite RCR-based), described above.
Variables of merit and interest not currently used for ranking:
OR: median (odds ratio, inverted if <1) supporting gene–trait association.
N_beta: simple count of beta values with 95% confidence intervals supporting gene–trait association.
N_snp: SNPs involved with gene–trait association.
N_study: studies supporting gene–trait association.
study_N: mean (SAMPLE_SIZE) supporting gene–trait association.
geneNtrait: total traits associated with the gene.
traitNgene: total genes associated with the trait.
N_snp, N_study, geneNtrait and traitNgene are counts of the corresponding unique entities. From the variables selected via benchmark testing the meanRankScore is computed using (3):
(3) |
where ranki = rank of ith variable and N = number of variables considered.
μ scores were implemented via the muStat (Wittkowski and Song, 2010). Vectors of ordinal variables represent each case, and non-dominated solutions are cases which are not inferior to any other case at any variable. (For TIGA, cases are genes or traits, corresponding with trait queries or gene queries, respectively, and their variables of merit described above.) The set of all non-dominated solutions defines a Pareto boundary. A μ score is defined simply as the number of lower cases minus the number of higher, but the ranking is the useful result. The ranking rule between case k and case k′ may be formalized as in (4). Simply put, case k′ is higher than case k if it is higher in some variable(s) and lower in none
(4) |
2.5 Benchmark against gold standard
Lacking a suitable gold standard set of gene–trait associations in general, we instead relied on established gene–disease associations from the genetics home reference (Fomous et al., 2006) and UniProtKB (The UniProt Consortium, 2018) databases. This gold standard set was built following a previously described approach (Pletscher-Frankild et al., 2015). It consists of 5366 manually curated associations (positive examples) between 3495 genes and 709 diseases. All other (2 472 589) possible pairings of these genes and diseases were considered negative examples.
To assess the quality of the TIGA gene–trait associations, we mapped the Ensembl gene IDs to STRING v11 identifiers using the STRING alias file (Szklarczyk et al., 2019) and the EFO terms to disease ontology identifiers (Schriml et al., 2019) based on ontology cross-references and the EMBL-EBI ontology Xref service. We then benchmark any individual variable or multivariate ranking of the associations by constructing the receiver operating characteristic (ROC) curve by counting the agreement with the gold standard.
3 Results
3.1 The TIGA web application
TIGA facilitates drug target illumination by currently scoring and ranking associations between protein-coding genes and GWAS traits. While not capturing the entire Catalog, the TIGA app can aggregate and filter GWAS findings for actionable intelligence, e.g. to enrich target prioritization via interactive plots and hitlists (Fig. 3), allowing users to identify the strongest associations supported by evidence.
Hits are ranked by meanRankScore described in Section 2. Scatterplot axes are effect (OR or N_beta) versus evidence as measured by meanRankScore. This app accepts ‘trait’ and ‘gene’ query parameters via URL, e.g. ?trait=EFO_0004541, ?gene=ENSG00000075073, ?trait=EFO_0004541&gene=ENSG00000075073. Gene markers are colored by target development level (TDL) (Oprea et al., 2018). TDL is a knowledge-based classification that bins human proteins into four ordinal and non-overlapping categories: Tclin, mechanism-of-action designated targets via which approved drugs act (Avram et al., 2020; Santos et al., 2017; Ursu et al., 2019); Tchem are proteins known to bind small molecules with high potency; Tbio includes proteins that have gene ontology (Ashburner et al., 2000) ‘leaf’ (lowest level) experimental terms; or meet two of these conditions: a fractional publication count (Pafilis et al., 2013) above 5, 3 or more gene ‘Reference Into Function’ annotations (Mitchell et al., 2003), or 50 or more commercial antibodies in Antibodypedia (Björling and Uhlén, 2008); Tdark are manually curated UniProtKB proteins that fail to place in any of the previous categories.
3.2 Benchmark against gold standard disease–gene associations
To benchmark the quality of the GWAS associations in TIGA, we focused on the 383 EFO terms that could be mapped to diseases and their 20 458 associations with genes. We evaluated the performance of each variable of merit individually against the manually curated gold standard gene–disease associations.
The resulting ROC curves showed that the best performing variables are RCRAS, N_study, pVal_mLog, N_snpw and N_snp, which have areas under the curve (AUC) higher than 0.6 (Fig. 4A and Supplementary Fig. S1). The three variables RCRAS, pVal_mLog and N_snpw are furthermore complementary, having a maximal pairwise Spearman correlation of 0.325, whereas N_study and N_snp are strongly correlated with the better performing RCRAS and N_snpw, respectively. We thus used these three variables as the basis for calculating two multivariate rankings, namely μ score and meanRankScore. We benchmarked both rankings the same way as the individual variables and found that μ score performs marginally better than meanRankScore score based on their AUC values (Fig. 4B). However, as the meanRankScore outperforms the μ score in the area of interest [0.0, 0.2] and is more than five orders of magnitude faster to calculate, we selected it as the final ranking in TIGA. Corresponding plots for the lower performing variables are provided in the Supplementary materials for completeness.
3.3 Using TIGA for drug target illumination
The main motivation of developing TIGA is to capture GWAS data when illuminating drug targets. Table 1 shows how many targets from each protein family and Illuminating the Druggable Genome (IDG) TDL are covered with associated traits in TIGA, with families as defined by drug target ontology (Lin et al., 2017) level 2. IDG TDL is a knowledge-based classification: Tclin = mechanism-of-action drug targets (Santos et al.); Tchem = small-molecule modulators are known; Tbio = biological function elucidated; Tdark = understudied protein (Oprea et al., 2018). Coverage for the understudied 2469 Tdark proteins is of particular interest. However, the data for other TDLs can also provide unique and complementary evidence, especially in case of Tbio proteins that are biologically characterized but have not been clinically validated.
Table 1.
Family/TDL | Tclin | Tchem | Tbio | Tdark | Total |
---|---|---|---|---|---|
G protein-coupled receptor | 73/101 | 78/143 | 73/129 | 110/407 | 334/780 |
Ion channel | 97/127 | 59/89 | 72/116 | 12/20 | 240/352 |
Kinase | 57/66 | 278/360 | 97/133 | 12/20 | 444/579 |
Calcium-binding protein | 3/5 | 1/3 | 58/93 | 8/11 | 70/112 |
Cell–cell junction | 0/0 | 0/0 | 22/49 | 8/12 | 30/61 |
Cell adhesion | 0/1 | 0/2 | 23/52 | 6/15 | 29/70 |
Cellular structure | 4/10 | 5/11 | 244/323 | 44/86 | 297/430 |
Chaperone | 0/1 | 8/9 | 27/46 | 6/8 | 41/64 |
Enzyme modulator | 4/5 | 25/44 | 376/532 | 50/101 | 455/682 |
Enzyme | 69/104 | 277/387 | 1022/1553 | 177/332 | 1545/2376 |
Epigenetic regulator | 9/13 | 41/55 | 16/22 | 0/1 | 66/91 |
Extracellular structure | 0/1 | 0/1 | 50/57 | 8/9 | 58/68 |
Immune response | 0/1 | 0/2 | 13/41 | 4/6 | 17/50 |
Nuclear receptor | 16/18 | 16/19 | 8/11 | 0/0 | 40/48 |
Nucleic acid binding | 0/1 | 13/19 | 354/603 | 67/131 | 434/754 |
Transcription factor | 1/2 | 12/16 | 385/557 | 73/163 | 471/738 |
Transporter | 31/37 | 63/82 | 405/605 | 105/160 | 604/884 |
Receptor | 20/24 | 6/12 | 157/225 | 27/55 | 210/316 |
Signaling | 13/24 | 24/32 | 245/338 | 17/34 | 299/428 |
Storage | 0/1 | 0/1 | 2/7 | 1/2 | 3/11 |
Surfactant | 0/0 | 0/0 | 3/5 | 0/0 | 3/5 |
Other | 95/134 | 233/337 | 3973/6131 | 1734/3416 | 6035/10 018 |
Total | 492/676 | 1139/1624 | 7625/11 628 | 2469/4989 | 11 725/18 917 |
Figures 3 and 5 illustrate a typical use case, the plot and gene list for trait ‘high-density lipoprotein cholesterol measurement’, which monitors blood levels of high-density lipoprotein cholesterol as a risk factor for heart disease. Figure 6 shows the provenance for one of the associated genes, GIMAP6 ‘GTPase IMAP family member 6’ with the scores and studies for this gene–trait association, including links to the Catalog and PubMed. GIMAP6 is an understudied (Tbio) member of the GTPases of immunity-associated protein family (GIMAP). Although literature-based evidence of its role in cholesterol homeostasis is scarce (Hoffmann et al., 2018; Richardson et al., 2020), this finding is substantiated by significantly increased circulating HDL cholesterol levels in GIMAP6-knock-out female mice (https://bit.ly/3uznvCU), suggesting that loss of GIMAP6 function may be linked with hypercholesterolemia-associated disorders.
Figures 7 and 8 illustrate another target illumination example, for trait ‘HbA1c measurement’ (glycated hemoglobin, signifying prolonged hyperglycemia), highly relevant to the management of type 2 diabetes mellitus (Rahbar et al., 1969; Saudek and Brick, 2009). Figure 8 shows the provenance for one of the associated genes, SLC25A44 ‘Solute carrier family 25 member 44’ with the scores and studies for this gene–trait association, including links to the Catalog and PubMed. SLC25A44 is an understudied (Tdark) branched-chain amino acid (BCAA) transporter that acts as metabolic filter in brown adipose tissue, contributing to metabolic health (Yoneshiro et al., 2019) and may be involved in subcutaneous white adipose BCAA catabolism (Lee et al., 2021).
4 Discussion
4.1 Target illumination
The explicit goal of the NIH IDG program (Oprea et al., 2018) is to ‘map the knowledge gaps around proteins encoded by the human genome’. TIGA is fully aligned with this goal, as it evaluates the GWAS evidence for disease (trait)–gene associations. TIGA generates GWAS-centric trait–gene association dataset using an automated, sustainable workflow amenable for integration into the Pharos portal (Nguyen et al., 2017; Sheils et al., 2021). The Open Targets platform (Ghoussaini et al., 2021; Ochoa et al., 2021) uses Catalog data and other sources, assisted by supervised machine learning, to identify probable causal genes, and validate therapeutic targets by aggregating and scoring disease–gene associations for ‘practicing biological scientists in the pharmaceutical industry and in academia’. OpenTargets associations are enhanced, yet limited by the training data and knowledge sources reflecting current understandings of genetics. In contrast, TIGA provides aggregated evidence solely from the Catalog, reflecting experimental results with minimal bias, thus interpretable in terms of provenance and methodology, and more suitable for some downstream consumers and use cases.
4.2 From information to useful knowledge
In data-intensive fields such as genomics, specialized tools facilitate knowledge discovery, yet interpretation and integration can be problematic for non-specialists. Accordingly, this unmet need for integration and interpretation requires certain layers of abstraction and aggregation, which depend on specific use cases and objectives. Our target audience is drug discovery scientists for whom the aggregated findings of GWAS, appropriately interpreted, can provide additional value as they seek to prioritize targets. This clear purpose serves to focus and simplify all aspects of its design. Our approach for evidence aggregation is simple, easily comprehensible, and based on what may be regarded as axiomatic in science and rational inductive learning: First and foremost, evidence is measured by counting independent confirmatory results.
Interpretability concerns exist throughout science, but GWAS is understood to present particular challenges (Gallagher and Chen-Plotkin, 2018; Lambert and Black, 2012; Marigorta et al., 2018; Visscher et al., 2017). The main premise of GWAS is that genotype–phenotype correlations reveal underlying molecular mechanisms. While correlation does not imply causation, it contributes to plausibility of causation. Genomic dataset size adds difficulty. The standard GWAS P-value significance threshold is 5 × 10−8 based on overall P-value 0.05 and Bonferroni multiple testing adjustment for 1–10 million tests/SNPs (Marigorta et al., 2018). The statistical interpretation is that the family-wise error rate, or overall probability of a type 1 error, is 5%, but associations to mapped genes require additional interpretation. Motivated by, and despite these difficulties, it is our belief that GWAS data can be rationally interpreted and used by non-specialists, if suitably aggregated. Accordingly, TIGA is a rational way to suggest and rank research hypotheses, with the caveat that the identified signals may be accompanied by experimental noise and systematic uncertainty.
4.3 Designing for downstream integration
Biomedical knowledge discovery depends on integration of sources and data types which are heterogeneous in the extreme, reflecting the underlying complexity of biomedical science. These challenges are increasingly understood and addressed by improving data science methodology. However, provenance, interpretability and confidence aspects are underappreciated and rarely discussed. As in all signal propagation, errors and uncertainty accrue and confidence decays. Here, we proposed the use of simple, transparent and comprehensible metrics to assess the relative confidence of disease–gene associations, via the unbiased meanRank scores. Figure 9 summarizing TIGA sources and interfaces, illustrates its well-defined role. Continuous confidence scores support algorithmic weighting and filtering. Standard identifiers and semantics support rigorous integration. Limiting provenance to the Catalog and its linked publications, semantic interpretability is enhanced.
5 Conclusions
We agree with Visscher et al. (2017) that ‘the paradigm of “one gene, one function, one trait” is the wrong way to view genetic variation’. Yet given the intrinsic complexity of biomedical science, progress often requires simplifying assumptions. Findings must be interpreted in context for an audience and application. Mindful of these concerns and limitations, TIGA provides a directly interpretable window into GWAS data, specifically for drug target hypothesis generation and elucidation. As interest in ‘interpretable machine learning’ and ‘explainable artificial intelligence’ (Gilpin et al., 2018) grows, TIGA summarizes gene–trait associations derived solely and transparently from GWAS summary- and metadata, with rational and intuitive evidence metrics and a robust, open-source pipeline designed for continual updates and improvements. Whether in stand-alone mode, or by integration with other interfaces, TIGA aims to contribute to drug target identification and prioritization.
6 Abbreviations and definitions
Common terms used in GWAS and related fields can vary in their definitions and connotations depending on context. Therefore, for clarity and rigor the following definitions are provided, which we consider consistent with best practices in the GWAS and drug discovery communities.
Genotype | An organism has one genotype, comprised of a germ line genome and multiple somatic genomes. Statistical models may assume a population distribution hence a population genotype. |
Phenotype | An organism has one phenotype, comprised of (potentially) all non-genomic observable characteristics, a.k.a. phenotypic traits. |
Gene | Genomic unit responsible for an expression product. Protein-coding genes are a subset of this definition. |
Trait | Single non-genomic, observable characteristic. |
Drug target | Biomolecular entity involved in the mechanism of action of a drug. The IDG project is protein-centric; in the context of this work, all drug targets are proteins. |
Funding
This work was supported by US National Institutes of Health [U24 224370] for ‘Illuminating the Druggable Genome Knowledge Management Center’ (IDG KMC) and by the Novo Nordisk Foundation [NNF14CC0001].
Conflict of Interest: C.G.L. has a financial interest in Golden Helix Inc., a company which sells GWAS and other bioinformatics software. L.J.J. is one of the owners and Scientific Advisory Board members of Intomics A/S. T.I.O. has received honoraria or consulted for Abbott, AstraZeneca, Chiron, Genentech, Infinity Pharmaceuticals, Merz Pharmaceuticals, Merck Darmstadt, Mitsubishi Tanabe, Novartis, Ono Pharmaceuticals, Pfizer, Roche, Sanofi and Wyeth. He is on the Scientific Advisory Board of ChemDiv Inc. and InSilico Medicine. All other authors declared no conflict of interest.
Supplementary Material
Contributor Information
Jeremy J Yang, Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA; Integrative Data Science Laboratory, School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN 47408, USA.
Dhouha Grissa, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark.
Christophe G Lambert, Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA.
Cristian G Bologa, Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA.
Stephen L Mathias, Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA.
Anna Waller, Department of Pathology, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA.
David J Wild, Integrative Data Science Laboratory, School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN 47408, USA.
Lars Juhl Jensen, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark.
Tudor I Oprea, Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA; Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark.
References
- Ashburner M. et al. (2000) Gene ontology: toolfor the unification of biology. Nat. Genet., 25, 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avram S. et al. (2020) Novel drug targets in 2019. Nat. Rev. Drug Discov., 19, 300. [DOI] [PubMed] [Google Scholar]
- Björling E., Uhlén M. (2008) Antibodypedia, a portal for sharing antibody and antigen validation data. Mol. Cell. Proteomics, 7, 2028–2037. [DOI] [PubMed] [Google Scholar]
- Bossé Y., Amos C.I. (2018) A decade of GWAS results in lung cancer. Cancer Epidemiol. Biomarkers Prev., 27, 363–379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buniello A. et al. (2019) The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res., 47, D1005–D1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cannon D.C. et al. (2017) TIN-X: target importance and novelty explorer. Bioinformatics, 33, 2601–2603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eicher J.D. et al. (2015) GRASP v2.0: an update on the genome-wide repository of associations between SNPs and phenotypes. Nucleic Acids Res., 43, D799–D804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fomous C. et al. (2006) ‘Genetics home reference’: helping patients understand the role of genetics in health and disease. Community Genet., 9, 274–278. [DOI] [PubMed] [Google Scholar]
- Gallagher M.D., Chen-Plotkin A.S. (2018) The post-GWAS era: from association to function. Am. J. Hum. Genet., 102, 717–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghoussaini M. et al. (2021) Open targets genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res., 49, D1311–D1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilpin L.H. et al. (2018) Explaining explanations: an overview of interpretability of machine learning. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). arXiv:1806.00069v3 [cs.AI].
- Greene C.S. et al. (2015) Understanding multicellular function and disease with human tissue-specific networks. Nat. Genet., 47, 569–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoffmann T.J. et al. (2018) A large electronic-health-record-based genome-wide study of serum lipids. Nat. Genet., 50, 401–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hutchins B.I. et al. (2016) Relative citation ratio (RCR): a new metric that uses citation rates to measure influence at the article level. PLoS Biol., 14, e1002541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hutchins B.I. et al. (2019) The NIH open citation collection: a public access, broad coverage resource. PLoS Biol., 17, e3000385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lambert C.G., Black L.J. (2012) Learning from our GWAS mistakes: from experimental design to scientific method. Biostatistics, 13, 195–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lamparter D. et al. (2016) Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics. PLoS Comput. Biol., 12, e1004714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee S. et al. (2021) Branched-chain amino acid metabolism, insulin sensitivity and liver fat response to exercise training in sedentary dysglycaemic and normoglycaemic men. Diabetologia, 64, 410–423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li M.J. et al. (2016) GWASdb v2: an update database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res., 44, D869–D876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li T. et al. (2018) GeNets: a unified web platform for network-based genomic analyses. Nat. Methods, 15, 543–546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Y. et al. (2017) Drug target ontology to classify and integrate drug discovery data. J. Biomed. Semantics, 8, 50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J.Z. et al. ; AMFS Investigators (2010) A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet., 87, 139–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marigorta U.M. et al. (2018) Replicability and prediction: lessons and challenges from GWAS. Trends Genet., 34, 504–517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mishra A., Macgregor S. (2015) VEGAS2: software for more flexible gene-based testing. Twin Res. Hum. Genet., 18, 86–91. [DOI] [PubMed] [Google Scholar]
- Mitchell J.A. et al. (2003) Gene indexing: characterization and analysis of NLM’s GeneRIFs. AMIA Annu. Symp. Proc., 2003, 460–464. [PMC free article] [PubMed] [Google Scholar]
- Nguyen D.-T. et al. (2017) Pharos: collating protein information to shed light on the druggable genome. Nucleic Acids Res., 45, D995–D1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ochoa D. et al. (2021) Open targets platform: supporting systematic drug-target identification and prioritisation. Nucleic Acids Res., 49, D1302–D1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oprea T.I. et al. (2018) Unexplored therapeutic opportunities in the human genome. Nat. Rev. Drug Discov., 17, 377. [DOI] [PubMed] [Google Scholar]
- Pafilis E. et al. (2013) The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLoS One, 8, e65390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pallejà A. et al. (2012) DistiLD database: diseases and traits in linkage disequilibrium blocks. Nucleic Acids Res., 40, D1036–D1040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pletscher-Frankild S. et al. (2015) DISEASES: text mining and data integration of disease–gene associations. Methods, 74, 83–89. [DOI] [PubMed] [Google Scholar]
- Rahbar S. et al. (1969) Studies of an unusual hemoglobin in patients with diabetes mellitus. Biochem. Biophys. Res. Commun., 36, 838–843. [DOI] [PubMed] [Google Scholar]
- Ramos E.M. et al. (2014) Phenotype-genotype integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources. Eur. J. Hum. Genet., 22, 144–147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richardson T.G. et al. (2020) Evaluating the relationship between circulating lipoprotein lipids and apolipoproteins with risk of coronary heart disease: a multivariable Mendelian randomisation analysis. PLoS Med., 17, e1003062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rusu V. et al. ; SIGMA T2D Consortium (2017) Type 2 diabetes variants disrupt function of SLC16A11 through two distinct mechanisms. Cell, 170, 199–212.e20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Santos R. et al. (2017) A comprehensive map of molecular drug targets. Nat. Rev. Drug Discov., 16, 19–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saudek C.D., Brick J.C. (2009) The clinical use of hemoglobin A1c. J. Diabetes Sci. Technol., 3, 629–634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schriml L.M. et al. (2019) Human disease ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res., 47, D955–D962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheils T.K. et al. (2021) TCRD and pharos 2021: mining the human proteome for disease biology. Nucleic Acids Res., 49, D1334–D1346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen J. et al. (2017) STOPGAP: a database for systematic target opportunity assessment by genetic association predictions. Bioinformatics, 33, 2784–2786. [DOI] [PubMed] [Google Scholar]
- Szklarczyk D. et al. (2019) STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res., 47, D607–D613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The NHGRI-EBI GWAS Catalog GWAS Catalog Curation. GWAS Catalog. https://www.ebi.ac.uk/gwas/home (22 October 2020, date last accessed).
- The NHGRI-EBI GWAS Catalog FAQ. https://www.ebi.ac.uk/gwas/docs/faq (24 September 2019, date last accessed).
- The UniProt Consortium (2018) UniProt: the universal protein knowledgebase. Nucleic Acids Res, 46, 2699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ursu O. et al. (2019) Novel drug targets in 2018. Nat. Rev. Drug Discov., 18, 328. [DOI] [PubMed] [Google Scholar]
- Visscher P.M. et al. (2017) 10 Years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet., 101, 5–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wainberg M. et al. (2019) Opportunities and challenges for transcriptome-wide association studies. Nat. Genet., 51, 592–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y. et al. (2006) A fine-scale linkage-disequilibrium measure based on length of haplotype sharing. Am. J. Hum. Genet., 78, 615–628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilkinson M.D. et al. (2016) The FAIR guiding principles for scientific data management and stewardship. Sci. Data, 3, 160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wittkowski K.M., Song T. (2010) Nonparametric methods for molecular biology. Methods Mol. Biol., 620, 105–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoneshiro T. et al. (2019) BCAA catabolism in brown fat controls energy homeostasis through SLC25A44. Nature, 572, 614–619. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.