Home Browse IsoAligner: dynamic mapping of amino acid positions across protein...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

IsoAligner: dynamic mapping of amino acid positions across protein isoforms

[version 1; peer review: 2 approved with reservations]

Jacob Hanimann¹, Holger Moch¹, Martin Zoche¹, Abdullah Kahraman ^1,2

PUBLISHED 31 Mar 2022

Author details Author details

¹ Department of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Zurich, 8091, Switzerland
² Swiss Institute of Bioinformatics, Lausanne, Lausanne, 1015, Switzerland

Jacob Hanimann
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Holger Moch
Roles: Funding Acquisition, Resources

Martin Zoche
Roles: Conceptualization, Funding Acquisition, Resources

Abdullah Kahraman
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Cell & Molecular Biology gateway.

Abstract

Aligning protein isoform sequences is often performed in cancer diagnostics to homogenise mutation annotations from different diagnostic assays. However, most alignment tools are fitted for homologous sequences, leading often to alignments of non-identical exonic regions. Here, we present the interactive alignment webservice IsoAligner for exact mapping of exonic protein subsequences. The tool uses a customized Needleman-Wunsch algorithm including an open gap penalty combined with a gene-specific minimal exon length function and dynamically adjustable parameters. As an input, IsoAligner accepts either various gene/transcript/protein IDs from different databases (Ensembl, UniProt, RefSeq) or raw amino acid sequences. The output of IsoAligner consists of pairwise alignments and a table of mapped amino acid positions between the canonical or supplied isoform IDs and all alternative isoforms. IsoAligner’s human isoform library comprises of over 1.3 million IDs mapped on over 120,000 protein sequences. IsoAligner, is a fast and interactive alignment tool for retrieving amino acids positions between different protein isoforms. Its application will allow diagnostic and precision medicine labs to detect inconsistent variant annotations between different assays and databases. Availability: This tool is available as a Webservice on www.isoaligner.org. A REST API is available for programmatic access. The source code for both services can be found at https://github.com/mtp-usz/IsoAligner.

Keywords

alignment, protein isoform, amino acid sequence, protein ids, exon-mapping, amino acid position, splice-variant

Corresponding author: Abdullah Kahraman

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2022 Hanimann J et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Hanimann J, Moch H, Zoche M and Kahraman A. IsoAligner: dynamic mapping of amino acid positions across protein isoforms [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:382 (https://doi.org/10.12688/f1000research.76154.1) First published: 31 Mar 2022, 11:382 (https://doi.org/10.12688/f1000research.76154.1) Latest published: 31 Mar 2022, 11:382 (https://doi.org/10.12688/f1000research.76154.1)

Introduction

Mapping isoform sequences to each other and identifying corresponding amino acids (AA) between isoforms is an important and prevalent task, especially in the interpretation of cancer mutations [Stephenson et al., 2019]. Most functional databases like COSMIC (RRID:SCR 002260), ClinVar (RRID:SCR 006169), or gnomAD (RRID:SCR 014964) use the longest protein isoform as a reference for their annotations. However in cancer diagnostics, shorter splice variants with different amino acid positions can be chosen, leading to confusions with respect to the existence of mutations in the aforementioned functional databases. Many inconsistent variant annotations exist [Tsai et al., 2021]. For example, the MET p.D1246N resistance mutation arising in lung cancers with MET exon 14 skipping events is commonly annotated as p.D1228N in diagnostic assays. While the first is annotated on the 1408 AA long NM 001127500.3, the latter is annotated on the 18 AA shorter transcript NM 000245.4. Simply looking for information on MET p.D1228N in functional databases might thus result in wrong conclusions. To find corresponding AAs in other databases, one general approach is to read off the corresponding positions from a pairwise global alignment like the Needleman-Wunsch algorithm (available for example at the EBI or SIB). However, the optimal solution to a global alignment can include the alignment of distinct exons giving the false impression that AA at these regions are corresponding to each other. To circumvent this problem, the Mirage [Nord et al., 2018] software performs a computationally expensive multiple sequence alignment of the corresponding sequences in the genome.

Here, we introduce IsoAligner, the first web-service for effortless, fast, dynamic and interactive positional mapping of AA between isoform sequences. The tool applies a gene-specific minimal exon length function integrated into a Needleman-Wunsch algorithm to identify false-positive correspondences between amino acids of different isoforms. IsoAligner is simple, interactive, supports simultaneously gene/transcript/protein IDs from ENSEMBL (RRID:SCR 002344), UniProt (RRID:SCR 002380), RefSeq (RRID:SCR 003496), HGNC (RRID:SCR 002827) and UCSC (RRID:SCR 011624), and returns a ready-to-use mapping table between AA positions of protein isoforms. The source code for IsoAligner is available from GitHub and is archived with Zenodo (Hanimann & Kahraman, 2022).

Methods

Alignment approach

The challenge of aligning protein isoforms can be described as matching identical exons. The IsoAligner algorithm exploits this elementary characteristics of isoforms and applies custom parameters to the established Needleman-Wunsch algorithm followed by an evaluation of all subalignment lengths. Subalignments that do not meet the gene-specific minimal exon length, are discarded and marked as false-positive correspondences. The default global alignment parameters have been selected to support island-like solutions of the alignment with an heuristically predefined open gap penalty score. Gap extensions are not penalized (match: 1, mismatch: -2, open gap: -1, gap extend: 0). However, the user has the possibility to interactively change and adjust these parameters.

Implementation

The IsoAligner software is written in Python v3.8 (RRID:SCR 008394). The alignment algorithm is based on the align.globalms function of the pairwise2 module from the Bio package (RRID:SCR 007173) and the website is built with streamlit. A REST API for programmatic access runs on the flask framework.

Human Isoform Library

The Human Isoform Library forms the core of IsoAligner. The library is a comprehensive reference database comprising +1.3 million gene/transcript/protein IDs mapped on 120k protein sequences from multiple sequence reference databases namely Ensembl [Howe et al., 2021], UniProt [The Uniprot Consortium, 2021], RefSeq [O’Leary et al., 2016], HGNC [Tweedie et al., 2021] and UCSC [Damian Smedley et al., 2015]. The integration of the different databases was carried out by pairing IDs of protein sequences to each other by using the Biomart (RRID:SCR 002987) mapping tool and comparing raw amino acid sequences. The individual minimal exon length was required to be at least three AA and extracted from Ensembl’s GTF file (v104). For custom sequences provided by the user, we set the minimum exon length to 12 AA corresponding to the median length of all shortest exons in a gene. Using our adapted Needleman-Wunsch alignment approach we were able to map for the whole human isoform library with 106k alignments and a total of 40.5 million perfect AA matches (dataframe available online). However, we could also identify 862,136 false-positively aligned amino acids positions that could have resulted in a wrong amino acid position in an alternative isoform. Ultimately, our human isoform library provides a clean positional mapping table for corresponding AA for all alternative protein isoforms.

Operation

Since the front-end of IsoAligner is built with streamlit, it is compatible with following browsers:

Google Chrome (version 98 or newer)
Firefox (version 97 or newer)
Microsoft Edge (version 98 or newer)
Safari (version 14 or newer)

Website. The input text field on www.isoaligner.org accepts various gene and protein IDs as well as custom sequence pairs (see Figure 1). Gene and protein IDs from different databases can be mixed and searched simultaneously. The workflow is as follows:

Figure 1. Overview of IsoAligner: the project structure from back-end generation of the human isoform library to the front-end user interaction.

Quick Start: Click on ’Show Example’ and then ’Search and Align’ to get a overview.

Enter either one isoform ID per gene or two isoform IDs per gene or a list of genes names or two raw amino acid sequences. The input can be tab, comma or whitespace separated. Click ’Search and Align’ or ’Align’ to compute alignments.
Information to the chosen reference sequence and its alignments against all other isoforms can be displayed by using corresponding drop-down buttons.
Further down on the page, the computed mapping table is shown. On the left sidebar of the application, the function parameters of the Needleman-Wunsch algorithm and the minimal exon length function are displayed. Changing the values of the parameters, instantly updates the alignment visualisation and mapping table.
The mapping table can be filtered using the ’Filter table for exact value’ input field and pressing enter.
The entirety of the mapping table can additionally be complemented with associated isoform IDs and be downloaded as a csv or tsv file.

Further information on how to use the IsoAligner can be found at the ”Manual & About” section on the left sidebar of IsoAligner’s website.

REST API. The REST API is built using Flask v1.1 and is accessible through the URL www.isoaligner.org/api. Currently, a get method for isoform IDs called ”map” is available for the retrieval of mapping tables between corresponding amino acid positions as well as the method ”align” to retrieve the alignment of two raw protein sequences.

The resource ”map” gives access to the human isoform library, computes alignments with specified parameters and retrieves whole mapping tables in json format. The only required parameter is id1, to provide the Isoform ID of interest. Additional parameters are:

Parameter: id1

ID of any type (Ensembl, Refseq, Uniprot, UCSC) to access the isoforms of a gene of the human isoform library. To define the reference protein sequence against which all other splice variants will be aligned, a specific isoform identifier should be used. Otherwise, the longest isoform is automatically chosen as the reference canonical sequence.
Request example: www.isoaligner.org/api/map?id1=EGFR-201
Response: Entirety of a mapping table in json format for EGFR with EGFR-201 defined as the reference sequence aligned against all other isoforms of the human isoform library.

Parameter: id2

Specific isoform ID (Ensembl, Refseq, Uniprot, UCSC) to use as the alternative splice variant to align with the reference sequence of id1.
Request example: www.isoaligner.org/api/map?id1=EGFR-201&id2=EGFR-207
Response: mapping table in json format for EGFR-201 aligned against EGFR-207.

Parameter: pos

In case of setting id1 and id2 in the request, a single corresponding AA positions on the alternative isoform sequence can be retrieved.
Request example: www.isoaligner.org/api/map?id1=EGFR-201&id2=EGFR-207&pos=1038
Response: 993

Parameter: min_ex_len

The alignment parameter for the minimal exon length (consecutive AAs) is gene-specific per default and can be manually defined as follows:
Request example: request:www.isoaligner.org/api/map?id1=EGFR-201&id2=EGFR-207&min ex len=23

Parameter: df_ids

Sequence database IDs to be included in the mapping table. Per default, the mapping table consists of the same type of IDs sent with the request. Available options are: [ensembl, refseq, uniprot, ucsc, hgnc].
Request example: request:www.isoaligner.org/api/map?id1=EGFR-201&id2=EGFR-207 &df ids=[ensembl,uniprot]

Parameter: match

Needleman-Wunsch alignment parameter to reward matches. This value must be ≥0.

Parameter: mismatch

Needleman-Wunsch alignment parameter to penalize mismatches. This value must be ≤0.

Parameter: open_gap

Needleman-Wunsch alignment parameter to penalize opening a gap. This value must be ≤0.

Parameter: gap_open

Needleman-Wunsch alignment parameter to penalize extending a gap. This value must be ≤0.

With the resource ”align”, one can align two raw amino acid sequences sent with the request and retrieve a mapping table in json format. The required parameters are seq1 and seq2. All alignment parameters: min ex len, match, mismatch, open gap, gap open are also applicable to this resource.

Parameters: seq1 and seq2

Reference and alternative raw amino acid sequences. Must be at least 7 AA’s long, for example:
Request: www.isoaligner.org/api/align?seq1=CRSSWTAAMELSAEYLREKLQRDLEAEHVE&seq2=YLREKLQRDLEAEHVEVEDTTLNRCSCSFRVLVVSAKFEGKPLLQRH
Response: mapping table in the json format.

Use cases

IsoAligner helps to transfer annotated mutational data from one transcript to another when working with different isoform database IDs. For example, the identification of the MET p.D1246N and p.D1228N resistance mutations in different transcripts as discussed in the introduction can be easily identified with IsoAligner. First, the user needs to paste one of the transcripts IDs, for example the RefSeq ID NM 000245.4 of the p.D1228N annotation into the ”Input” text field at the top of the website (see Figure 2).

Figure 2. Example input to investigate the position of MET D1228N on NM 000245.4 across isoforms.

Clicking the ”Search and Align” button, runs the IsoAligner algorithm and returns simple statistics on the transcript. In this case, there are 8 human isoform entries for the MET gene in the IsoAligner database. The Refseq ID given as the input (NM 000245.4) is automatically mapped to the ensemble transcript name (MET-202). Additional information such as the isoform sequence, various gene attributes and isoform IDs can be found by clicking the ”View details about this Isoform Entry” drop-down menu (see Figure 3).

Figure 3. Various information related to the sequence, gene attributes and isoform IDs can be found in IsoAligner.

Pairwise sequence alignments between all transcripts and the query transcripts can be found by clicking on the drop-down menu ”View Alignment Visualisations” (see Figure 4). The alignments update immediately, when ever any alignment parameters on the left-hand sidebar is adjusted.

Figure 4. Pairwise alignment between the query transcript NM 000245.4 and the canonical transcript NM 001127500.3, highlighting the corresponding amino acid positions D1228N and D1246N.

Figure 5. Mapping table listing corresponding amino acid positions between all MET isoforms in the IsoAligner database.

Further down, the user can find the ”Mapped Amino Acid Positions” table that lists the corresponding amino acid positions between all MET isoforms in the IsoAligner database and the query isoform NM 000245.4. Typing the amino acid position of the resistance mutation 1228 into the ”Filter table for exact value” text field and pressing enter shows all mapped amino acid position to position 1228 (see Figure 5). The first listed position is the corresponding amino acid in the canonical transcript MET-201 with the RefSeq ID NM 001127500.3. The last column provides the information that the corresponding amino acid position of 1228 in the canonical transcript is 1246. Note, that a corresponding amino acid to a shorter isoform with a RefSeq ID NM 001324402.2 is also shown, mapping the position 1228 to 798, while the third hit corresponds to a map between position 1210 in the query transcript and 1228 in the canonical transcript.

A ”Download Table” button is available to retrieve the entirety of the table in tsv or csv format. Consider clicking the checkbox ”Select all columns” to get additional database IDs for the transcripts. Alternatively, users might send an API request https://www.isoaligner.org/api/map?id1=NM_000245.4 to the REST API and retrieve a json file representation of the mapping table.

The data used in this use case can be found as Underlying data (Hanimann & Kahraman, 2022).

Conclusion

IsoAligner is a fast and interactive protein isoform alignment webservice that uses a customised Needleman- Wunsch algorithm to specifically align protein alternatively spliced isoforms. The comprehensive library of IsoAligner comprises 1.3 million IDs and 120k protein sequences of 19k human genes which allows rapid positional mapping of amino acids across isoform IDs from Ensembl, RefSeq, UCSC and UniProt.

Data availability

Zenodo: IsoAligner: dynamic mapping of amino acid positions across protein isoforms. https://doi.org/10.5281/zenodo.6354488 (Hanimann & Kahraman, 2022).

This project contains the following underlying data:

Human Isoform Library Data (the datasets used to generate the Human Isoform Library)
human isoform library v1.tsv.gz (the pre-computed mapped human isoform library)
Example Manuscript (the input/output files from the example Use Case section).

Software availability

Webtool available at: https://www.isoaligner.org/

REST API available at: https://www.isoaligner.org/api

Source code available from: https://github.com/mtp-usz/IsoAligner

Archived source code at time of publication: https://doi.org/10.5281/zenodo.6354488 (Hanimann & Kahraman, 2022)

License: CC0-1.0

Acknowledgements

We thank the members of the Clinical Computational Biology group at the University Hospital Zurich for their constant support and valuable inputs.

Faculty Opinions recommended

References

Hanimann J, Kahraman A: IsoAligner: dynamic mapping of amino acid positions across protein isoforms (IsoAligner v1.2.0). Zenodo. 2022. http://www.doi.org/10.5281/zenodo.6354488
Howe KL, Achuthan P, Allen J, et al.: Ensembl 2021. Nucleic Acids Res. 2021; 49(D1): D884–D891. PubMed Abstract | Publisher Full Text | Free Full Text
Nord A, Carey K, Hornbeck P, et al.: Splice-Aware Multiple Sequence Alignment of Protein Isoforms. ACM BCB. 2018; 2018: 200–210. PubMed Abstract | Publisher Full Text | Free Full Text
O’Leary NA, Wright MW, Brister JR, et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44(D1): D733–45. PubMed Abstract | Publisher Full Text | Free Full Text
Smedley D, Haider S, Durinck S, et al.: The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 2015; 43(W1): W589–W598. PubMed Abstract | Publisher Full Text | Free Full Text
Stephenson JD, Laskowski RA, Nightingale A, et al.: VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations. Bioinformatics. 2019; 35(22): 4854–4856. PubMed Abstract | Publisher Full Text | Free Full Text
The Uniprot Consortium: UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021; 49(D1): D480–D489. PubMed Abstract | Publisher Full Text | Free Full Text
Tsai JM, Hata AN, Lennerz JK: MET D1228N and D1246N are the Same Resistance Mutation in MET Exon 14 Skipping. Oncologist. 2021; 26(12): e2297–e2301. PubMed Abstract | Publisher Full Text | Free Full Text
Tweedie S, Braschi B, Gray K, et al.: Genenames.org: the HGNC and VGNC resources in 2021. Nucleic Acids Res. 2021; 49(D1): D939–D946. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 31 Mar 2022

Author details Author details

¹ Department of Pathology and Molecular Pathology, University Hospital Zurich, Zurich, Zurich, 8091, Switzerland
² Swiss Institute of Bioinformatics, Lausanne, Lausanne, 1015, Switzerland

Holger Moch
Roles: Funding Acquisition, Resources

Martin Zoche
Roles: Conceptualization, Funding Acquisition, Resources

Abdullah Kahraman
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (1)

version 1

Published: 31 Mar 2022, 11:382

https://doi.org/10.12688/f1000research.76154.1

© 2022 Hanimann J et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Hanimann J, Moch H, Zoche M and Kahraman A. IsoAligner: dynamic mapping of amino acid positions across protein isoforms [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:382 (https://doi.org/10.12688/f1000research.76154.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 31 Mar 2022

Views

Reviewer Report 17 Aug 2023

Laurens van de Wiel, Stanford University, Stanford, California, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.80113.r185231

Hanimann et al. present IsoAligner as fast and dynamic web server which intuitively incorporates the Needleman-Wunsch algorithm to align all protein isoforms for a protein-coding gene of interest. I applaud the authors for the aesthetically pleasing web server the authors have made and compliment on how the extensive functionality is intuitively presented. The authors argue that one of the novelties of IsoAligner is that alignments are made from the perspective of translated exon regions instead of the entire isoform amino-acid sequence composition. They kindly acknowledge that the computationally more expensive Mirage alignment software also operates from an exon-specific perspective. IsoAligner as a software seems very useful for many bioinformatic teams in the field of (human) genetics for variant interpretations. Especially from a variant data-integration point of view I can see that IsoAligner could lead to more accurate condensations of datasets.
I have the following remarks that I would like to see addressed:

The authors argue against using isoform-based alignments, as this often leads to alignments on non-identical exon regions. For this, the authors provide a single, but convincing, case on how this can lead to the wrong conclusions when referring to a protein-based variant without correctly adding the specific isoform. In terms of quantification, I can only find the number of falsely positive aligned amino acid positions that relate to this in the “Methods” section under “Human Isoform Library”. There is information lacking in this quantification:
1. The number of amino acids does not mean much outside of the context of genes and/or identical protein sequences between genes. How many of the 120k protein sequences are the exact same protein-product (suggesting to compare to UniProtKB) from their respective genes? From my own previous analysis in (DOI: 10.1002/humu.23798) which performed a similar analysis on only 2 sets: UniProtKB-SwissProt protein sequences with Ensembl GENCODE (hg19/v13) annotation sets: “there is an identical match of 79.4% of the Human Swiss-Prot protein sequences to one or more of 42,116 GENCODE transcripts. This means that 25.7% of the GENCODE transcriptions differ in messenger RNA (mRNA) but translate to the same Swiss-Prot protein sequence.”
2. The human genome and its protein products should not be taken out of context of genome build (e.g. GRCh37/hg19, GRCh38/hg38) and the corresponding genome annotation version (RefSeq, Ensembl/GENCODE, or combined as MANE). Adding this information would benefit the segmentation and transparency of the data in the manuscript and without it the web server would be less useful for the genomics community.
3. It would further aid the manuscript’s argument on exon vs isoform based alignment if the authors could expand by adding a table with a per-dataset comparison (annotation vs protein database) and a combined dataset comparison would help to compare if the exon-based alignment (e.g. there is no quantification on how many genes / protein sequences are extracted per dataset)
4. With these suggestions taken into account, some minor scientific oriented questions could be answered in the manuscript, such as: do falsely positively aligned amino acids differ between genome builds (hg19 vs hg38) or annotation source (e.g. RefSeq vs Ensemble)? Or the like.
Comparison with Mirage: IsoAligner is claimed to be faster than Mirage, and from my one-sided usage of IsoAligner and without previously having used Mirage. I can state that IsoAligner is indeed fast and responsive, but it would be beneficial to the argument of the manuscript to see some comparison with Mirage that shows the actual increase in speed and/or reduction of mathematical complexity and especially if this increase in speed sacrifices anything in terms of accuracy of output. Comparing a few challenging alignment cases would provide insights into differences, and a controlled environment speed test would further bolster this comparison.
I have been able to cause a server-side crash lasting several hours twice by entering “TTN” (25 transcripts, varying in aa length from 48aa to 35,991aa) as a stress-test. I would like to see if the authors are able to indeed analyse TTN and/or find a work around to this and similar cases. To ascertain if the error was client-side, I tried accessing IsoAligner on multiple devices and browsers. However, I confirmed that neither other websites nor devices could access the web server. In all cases the web server displayed the error message “Oh no. Error running app. If this keeps happening, please contact support”. Although preferable to an unresponsive server, I have several suggestions for future improvements, which I've detailed under minor suggestions below.
Multiple potential trackers/js are attached to the session (segment, googletagmanager), please add a privacy policy on what is tracked and for what purpose, see: https://ourworldindata.org/privacy-policy

Minor comments and suggestions:

Introduction:
- Consider specifying or removing the term "functional" from "functional databases". The current phrasing is ambiguous, making it unclear if you're referring to operational databases or databases containing functional variations. If it's the latter, note that variations in gnomAD and ClinVar aren't inherently functional.
- The variants in gnomAD and ClinVar are transcript-specific (I am unfamiliar with COSMIC), indicating every transcript based on user selection. The distinction between exon-based vs. isoform-based alignment might not be problematic unless authors provide examples for clarity.
Methods:
- The "alignment approach" section would be clearer with a step-by-step breakdown or even an illustrative figure. The figure showcasing two isoforms on the web server's "about" page could be integrated here, as it elucidates the challenge the authors address, even for non-experts.
- Results, such as "... with 106k alignments and a total of 40.5 million perfect AA matches", should be discussed in the "Results" section and not in "Methods".
General Observations
- For clarity, the first mention of large numbers should be spelled out in full. Subsequent mentions can be abbreviated, but clarity is paramount. For approximations, consider using symbols like "~" or terms like "about", "approximately", or "roughly" to avoid ambiguity.
Web server improvements:
- API Output format: The API example at https://www.isoaligner.org/api/map?id1=EGFR-201 returns JSON strings with escaped characters (e.g., \"). This typically results from using double quotes for strings. While not a major concern, it necessitates additional processing for the output.
- API Error Handling: If commands that don't exist are submitted, ensure the system provides descriptive error messages. For instance, using dfids instead of df_ids didn't yield an informative error.
- The page isoaligner.org/api instructs users to refer to the left sidebar on the main page for details on sending requests. It would be more user-friendly if the page directly detailed the necessary steps, especially if the main page became unavailable.
- It may be beneficial to implement explicit URL routes for direct navigation to specific isoforms, e.g., “isoaligner.org/ENSEMBL//”. This would simplify error tracking, especially if a recorded link caused server downtime.
GitHub:
- Kudos to the authors for open-sourcing the code. It would further aid users if a detailed configuration guide is provided on their GitHub, allowing for deployment of IsoAligner in custom environments.
- User-specific configuration files are present in the .idea and .streamlit directories, which is considered a bad-practice
- The requirements.txt file merely lists "Bio" without specifying a version. This can be problematic as it defaults to downloading the latest version, potentially introducing instability in future iterations of the software.

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

References

1. Wiel L, Baakman C, Gilissen D, Veltman JA, et al.: MetaDome: Pathogenicity analysis of genetic variants through aggregation of homologous human protein domains.Hum Mutat. 2019; 40 (8): 1030-1038 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics, protein sequence and structure, genomics, comparative genomics, evolutionary genetics, software development

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 05 Jul 2023

James D Stephenson, EMBL-EBI, Hinxton, UK

Approved with Reservations

https://doi.org/10.5256/f1000research.80113.r179554

An isoform alignment resource is a useful addition to the bioinformatics field. Whilst it is possible to extract equivalent positions across isoforms in other resources, the speed and ease, as well as the completeness and accessibility of IsoAligner make it a useful addition to the field.

The manuscript clearly explains the methods used and the breadth of resources used for the data which is a critical aspect of the resource. In addition to using all of the critical data sources for isoforms, IsoAligner also allows users to query using the various IDs used by these resources which makes retrieving data simple and intuitive. Multiple ways of searching make it easy for most users to use the tool with the data at hand rather than having to format or refactor it first. The availability of a bulk download is also a useful feature, as is the programmatic access via an API.

The code is deposited in GitHub with sufficient documentation which would allow replication of the software development by others and the methodology is clearly explained in the manuscript.

The example in the manuscript is useful to allow the interpretation of results generated using the tool. There is also sufficient help text in the user interface to help users understand the outputs. The fact that the example on the homepage is randomly generated rather than cherry picked is a good demonstration of the completeness of the tool and the confidence of the developers that the tool is robust. The performance of the user interface is also impressive, especially as it re-calculates alignments dynamically as the parameters are adjusted.

The conclusions stated are supported by the data and resource apart from one issue which requires addressing detailed below with regards to the number of proteins covered.

Minor points to address:

The home page states that “The current human isoform library consists of ~19'000 protein coding genes”. However the bulk download stats show that there are 16432. In the paper conclusion it also states “IsoAligner comprises 1.3 million IDs and 120k protein sequences of 19k human genes” It would be useful to explain this difference in the paper and ensure that the statement on the home page is not misleading.
As a very minor additional observation, the gene number is written as “19’000” on the website but a comma is used in the manuscript, eg “862,136 false-positively aligned amino acids”. This does not necessarily need to be changed but should ideally be consistent.
It would be useful for the website and the manuscript to mention the maintenance plan and update cycle for IsoAligner. When was the data last updated and when will the data be refreshed? The paper mentions that it was updated at the time of initial publication but that was early 2022 I believe? From the Gitlab repository it looks like 'human isoform library' data has not been updated for two years.
There is a drop down menu for species but there only appears to be humans in the list. Does this list auto-populate based on the input or does it exist because further species are planned in the future? If it is the latter, might I suggest it is removed until other species are available.
The table download feature is useful, especially as the user can download only the columns selected. I don’t think that it is clear enough however so I would recommend just adding some brief text to the download table button such as “download table with currently selected columns”.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Genomic variation analysis in protein coding regions.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 31 Mar 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 31 Mar 22	read	read

James D Stephenson, EMBL-EBI, Hinxton, UK
Laurens van de Wiel, Stanford University, Stanford, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

6 Views

17 Aug 2023 | for Version 1

Laurens van de Wiel, Stanford University, Stanford, California, USA

6 Views Cite this report Responses(0)

Approved With Reservations

The authors argue against using isoform-based alignments, as this often leads to alignments on non-identical exon regions. For this, the authors provide a single, but convincing, case on how this can lead to the wrong conclusions when referring to a protein-based variant without correctly adding the specific isoform. In terms of quantification, I can only find the number of falsely positive aligned amino acid positions that relate to this in the “Methods” section under “Human Isoform Library”. There is information lacking in this quantification:
1. The number of amino acids does not mean much outside of the context of genes and/or identical protein sequences between genes. How many of the 120k protein sequences are the exact same protein-product (suggesting to compare to UniProtKB) from their respective genes? From my own previous analysis in (DOI: 10.1002/humu.23798) which performed a similar analysis on only 2 sets: UniProtKB-SwissProt protein sequences with Ensembl GENCODE (hg19/v13) annotation sets: “there is an identical match of 79.4% of the Human Swiss-Prot protein sequences to one or more of 42,116 GENCODE transcripts. This means that 25.7% of the GENCODE transcriptions differ in messenger RNA (mRNA) but translate to the same Swiss-Prot protein sequence.”
2. The human genome and its protein products should not be taken out of context of genome build (e.g. GRCh37/hg19, GRCh38/hg38) and the corresponding genome annotation version (RefSeq, Ensembl/GENCODE, or combined as MANE). Adding this information would benefit the segmentation and transparency of the data in the manuscript and without it the web server would be less useful for the genomics community.
3. It would further aid the manuscript’s argument on exon vs isoform based alignment if the authors could expand by adding a table with a per-dataset comparison (annotation vs protein database) and a combined dataset comparison would help to compare if the exon-based alignment (e.g. there is no quantification on how many genes / protein sequences are extracted per dataset)
4. With these suggestions taken into account, some minor scientific oriented questions could be answered in the manuscript, such as: do falsely positively aligned amino acids differ between genome builds (hg19 vs hg38) or annotation source (e.g. RefSeq vs Ensemble)? Or the like.
Comparison with Mirage: IsoAligner is claimed to be faster than Mirage, and from my one-sided usage of IsoAligner and without previously having used Mirage. I can state that IsoAligner is indeed fast and responsive, but it would be beneficial to the argument of the manuscript to see some comparison with Mirage that shows the actual increase in speed and/or reduction of mathematical complexity and especially if this increase in speed sacrifices anything in terms of accuracy of output. Comparing a few challenging alignment cases would provide insights into differences, and a controlled environment speed test would further bolster this comparison.
I have been able to cause a server-side crash lasting several hours twice by entering “TTN” (25 transcripts, varying in aa length from 48aa to 35,991aa) as a stress-test. I would like to see if the authors are able to indeed analyse TTN and/or find a work around to this and similar cases. To ascertain if the error was client-side, I tried accessing IsoAligner on multiple devices and browsers. However, I confirmed that neither other websites nor devices could access the web server. In all cases the web server displayed the error message “Oh no. Error running app. If this keeps happening, please contact support”. Although preferable to an unresponsive server, I have several suggestions for future improvements, which I've detailed under minor suggestions below.
Multiple potential trackers/js are attached to the session (segment, googletagmanager), please add a privacy policy on what is tracked and for what purpose, see: https://ourworldindata.org/privacy-policy

Minor comments and suggestions:

Introduction:
- Consider specifying or removing the term "functional" from "functional databases". The current phrasing is ambiguous, making it unclear if you're referring to operational databases or databases containing functional variations. If it's the latter, note that variations in gnomAD and ClinVar aren't inherently functional.
- The variants in gnomAD and ClinVar are transcript-specific (I am unfamiliar with COSMIC), indicating every transcript based on user selection. The distinction between exon-based vs. isoform-based alignment might not be problematic unless authors provide examples for clarity.
Methods:
- The "alignment approach" section would be clearer with a step-by-step breakdown or even an illustrative figure. The figure showcasing two isoforms on the web server's "about" page could be integrated here, as it elucidates the challenge the authors address, even for non-experts.
- Results, such as "... with 106k alignments and a total of 40.5 million perfect AA matches", should be discussed in the "Results" section and not in "Methods".
General Observations
- For clarity, the first mention of large numbers should be spelled out in full. Subsequent mentions can be abbreviated, but clarity is paramount. For approximations, consider using symbols like "~" or terms like "about", "approximately", or "roughly" to avoid ambiguity.
Web server improvements:
- API Output format: The API example at https://www.isoaligner.org/api/map?id1=EGFR-201 returns JSON strings with escaped characters (e.g., \"). This typically results from using double quotes for strings. While not a major concern, it necessitates additional processing for the output.
- API Error Handling: If commands that don't exist are submitted, ensure the system provides descriptive error messages. For instance, using dfids instead of df_ids didn't yield an informative error.
- The page isoaligner.org/api instructs users to refer to the left sidebar on the main page for details on sending requests. It would be more user-friendly if the page directly detailed the necessary steps, especially if the main page became unavailable.
- It may be beneficial to implement explicit URL routes for direct navigation to specific isoforms, e.g., “isoaligner.org/ENSEMBL//”. This would simplify error tracking, especially if a recorded link caused server downtime.
GitHub:
- Kudos to the authors for open-sourcing the code. It would further aid users if a detailed configuration guide is provided on their GitHub, allowing for deployment of IsoAligner in custom environments.
- User-specific configuration files are present in the .idea and .streamlit directories, which is considered a bad-practice
- The requirements.txt file merely lists "Bio" without specifying a version. This can be problematic as it defaults to downloading the latest version, potentially introducing instability in future iterations of the software.

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics, protein sequence and structure, genomics, comparative genomics, evolutionary genetics, software development

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

3 Views

05 Jul 2023 | for Version 1

James D Stephenson, EMBL-EBI, Hinxton, UK

3 Views Cite this report Responses(0)

Approved With Reservations

The home page states that “The current human isoform library consists of ~19'000 protein coding genes”. However the bulk download stats show that there are 16432. In the paper conclusion it also states “IsoAligner comprises 1.3 million IDs and 120k protein sequences of 19k human genes” It would be useful to explain this difference in the paper and ensure that the statement on the home page is not misleading.
As a very minor additional observation, the gene number is written as “19’000” on the website but a comma is used in the manuscript, eg “862,136 false-positively aligned amino acids”. This does not necessarily need to be changed but should ideally be consistent.
It would be useful for the website and the manuscript to mention the maintenance plan and update cycle for IsoAligner. When was the data last updated and when will the data be refreshed? The paper mentions that it was updated at the time of initial publication but that was early 2022 I believe? From the Gitlab repository it looks like 'human isoform library' data has not been updated for two years.
There is a drop down menu for species but there only appears to be humans in the list. Does this list auto-populate based on the input or does it exist because further species are planned in the future? If it is the latter, might I suggest it is removed until other species are available.
The table download feature is useful, especially as the user can download only the columns selected. I don’t think that it is clear enough however so I would recommend just adding some brief text to the download table button such as “download table with currently selected columns”.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Genomic variation analysis in protein coding regions.

Respond to this report

Responses (0)

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

IsoAligner: dynamic mapping of amino acid positions across protein isoforms

Abstract

Keywords

Introduction

Methods

Alignment approach

Implementation

Human Isoform Library

Operation

Figure 1. Overview of IsoAligner: the project structure from back-end generation of the human isoform library to the front-end user interaction.

Use cases

Figure 2. Example input to investigate the position of MET D1228N on NM 000245.4 across isoforms.

Figure 3. Various information related to the sequence, gene attributes and isoform IDs can be found in IsoAligner.

Figure 4. Pairwise alignment between the query transcript NM 000245.4 and the canonical transcript NM 001127500.3, highlighting the corresponding amino acid positions D1228N and D1246N.

Figure 5. Mapping table listing corresponding amino acid positions between all MET isoforms in the IsoAligner database.

Conclusion

Data availability

Software availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated