Multilingual model improves zero-shot prediction of disease effects on proteins

Background

Models for mutation effect prediction in coding sequences rely on sequence-, structure-, or homology-based features. Here, we introduce a novel method that combines a codon language model with a protein language model, providing a dual representation of disease effects. By capturing contextual dependencies at both the genetic and protein level, our approach achieves a 3% increase in ROC-AUC classifying disease effects for 137,350 ClinVar missense variants across 13,791 genes, outperforming two single-sequence-based language models. Obviously the codon language model can uniquely differentiate synonymous from nonsense mutations at the genomic level. Our strategy uses information at complementary biological scales (akin to human multilingual models) to enable protein fitness landscape modeling and evolutionary studies, with potential applications in precision medicine, protein engineering, and genomics.

Notebooks/Scripts

cal_codon_logits.py: code for computing logits from CaLM
cal_codon_scores.py: code for computing effect scores at codon-level
cal_residue_logits.py: code for computing logits from ESM-2
cal_residue_scores.py: code for computing effect scores at residue-level
codon_preference.ipynb: code for identifying codon preferences
get_fasta.py: code for downloading cDNA and protein sequences from NCBI

Results

missense: contain all the prediction results for missense variants
nonsense: contain all the prediction results for nonsense variants
synonymous: contain all the prediction results for synonymous variants
codon preference.csv: codon preference results for each model

Datasets

All single-point mutations were downloaded from ClinVar (https://ftp.ncbi.nlm.nih.gov/pub/clinvar). The final benchmark dataset comprised 137,350 missense variants across 13,791 genes, 46,386 nonsense variants across 3,291 genes, and 527,579 synonymous variants across 11,593 genes.

Installation

To replicate our experiments, following the ESM-2 and CaLM first:

1. Install CaLM

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu 
pip3 install einops
pip3 install rotary_embedding_torch
pip3 install biopython
python setup.py install

2. Install ESM-2

pip install fair-esm  # latest release, OR:
pip install git+https://github.com/facebookresearch/esm.git  # bleeding edge, current repo main branch

References

Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., Rives, A.: Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems 34, 29287–29303 (2021)
Outeiral, C., Deane, C.M.: Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence 6(2), 170–179 (2024)
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637), 1123–1130 (2023)

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
Figure		Figure
Results		Results
bin		bin
data		data
plot		plot
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multilingual model improves zero-shot prediction of disease effects on proteins

Background

Contents

Notebooks/Scripts

Results

Datasets

Installation

1. Install CaLM

2. Install ESM-2

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Cassie818/Viral-mut

Folders and files

Latest commit

History

Repository files navigation

Multilingual model improves zero-shot prediction of disease effects on proteins

Background

Contents

Notebooks/Scripts

Results

Datasets

Installation

1. Install CaLM

2. Install ESM-2

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages