MSNGO: Multi-species protein function annotation based on 3D protein structure and network propagation
This is the code repository for protein function prediction model MSNGO.
MSNGO is a a multi-species protein function prediction model based on structural features and heterogeneous network propagation, which provides a structure encoder and can propagate structural feature on heterogeneous network for predicting Gene Ontology terms.
- The code was developed and tested using python 3.8.
- To install python dependencies run:
pip install -r requirements.txt
. Some libraries may need to be installed via conda. - The version of CUDA is
cudatoolkit==11.3.1
The data used are:
- Sequence: download from the UniProt website.
- PPI Network: download from the STRING website.
- Annotation: download from the GOA website.
- Gene Ontology: download from the GO website.
- AlphaFold structure: download from the AlphaFold website.
We also provide a small dataset which has less than 50 proteins to quickly test the model. It can be found here.
For a detailed description of data files, please see here.
Read here to get a quick start. If you want to train on your own dataset, please download esm2_t33_650M_UR50D.pt to MSNGO/esm2_t33_650M_UR50D/
Preprocessing.sh is for processing your raw data.
Then run the following command, it can process raw data.
./scripts/preprocessing.sh
The mf, bp, and cc branches will be trained, predicted, and evaluated by the following files respectively.
./scripts/run_mf.sh
./scripts/run_bp.sh
./scripts/run_cc.sh
Our trained model can be downloaded from here.
You can use the model directly to get predictions. Run the predict.py
script to make predictions about the input file (e.g. for MFO):
python predict.py --ontology mf -f your_test.fasta