BioSeq-BLM: a platform for analyzing DNA, RNA, and protein sequences based on biological language models
Software Requirements:
- Python3
- virtualenv or Anaconda
- CUDA 10.0 (Optional If using GPU)
- cuDNN (>= 7.4.1) (Optional If using GPU)
BioSeq-BLM has been tested on Windows, Ubuntu 16.04, and 18.04 operating systems.
virtualenv -p python3.7 venv
source ./venv/bin/activate
pip install -r requirements.txt
conda create -n venv python=3.7
conda activate venv
pip install -r requirements.txt
BioSeq-BLM
├───code // python source code for stand-alone package.
│
├───data // Used to place example datasets.
│
├───docs // Manual about stand-alone package.
│
├───results // After running the code, the output results can be found here.
|
├───scripts // Used to place the scripts selecting the best algorithms automatically.
|
├───software // Used to place the "Not Necessary Softwares" in installation.
|
|───LICENSE // the license.
│
|───README.md // repository description.
|
└───requirements.txt // Necessary file for installation.
Download the datasets from the BioSeq-BLM (Download), unzip them and put them in the '/data' folder.
Enter the '/code' directory and run the following command lines.
python BioSeq-BLM_Seq.py -category DNA -mode TF-IDF -words Mismatch -word_size 4 -cl Kmeans -nc 5 -dr PCA -np 64 -fs F-value -nf 128 -rdb fs -ml SVM -cost 4 -gamma -1 -sp combine -seq_file ../data/1-DHSs/dna_pos.txt ../data/1-DHSs/dna_neg.txt -label +1 -1
python BioSeq-BLM_Seq.py -category RNA -mode OHE -method RSS -cl Kmeans -nc 5 -fs MIC -nf 128 -dr TSVD -np 128 -rdb dr -ml SVM -cost 1 -gamma -4 -seq_file ../data/2-miRNA/rna_pos.txt ../data/2-miRNA/rna_neg.txt -rss_file ../data/2-miRNA/rna_with_2rd_structure.txt -label +1 -1
python BioSeq-BLM_Seq.py -category Protein -mode TM -method LSA -in_tm BOW -words Top-N-Gram -top_n 2 -com_prop 0.7 -sn L1-normalize -cl Kmeans -nc 5 -fs Tree -nf 128 -dr KernelPCA -np 128 -rdb dr -ml RF -seq_file ../data/3-DBPs/Protein_pos.txt ../data/3-DBPs/Protein_neg.txt -label +1 -1
python BioSeq-BLM_Res.py -category Protein -method BLOSUM62 -ml LSTM -epoch 10 -lr 0.01 -dropout 0.5 -batch_size 20 -fixed_len 300 -n_layer 2 -hidden_dim 64 -seq_file ../data/4-IDRs/protein_seq.txt -label_file ../data/4-IDRs/protein_label.txt
python BioSeq-BLM_Seq.py -category Protein -mode TM -method LSA -in_tm BOW -words Top-N-Gram -top_n 2 -cl Kmeans -nc 5 -fs Tree -nf 128 -dr TSVD -np 128 -rdb no -ml SVM -seq_file ../data/5-RBPs/RBP_590.txt ../data/5-RBPs/NRBP_590.txt -label +1 -1
BioSeq-BLM_Seq.py -category DNA -mode OHE -method One-hot -ml RF -seq_file ../data/6-RSS/ph/ph_seq_pos.txt ../data/6-RSS/ph/ph_seq_neg.txt -label +1 -1 -bp 1 -metric AUC
BioSeq-BLM_Seq.py -category DNA -mode OHE -method One-hot -ml RF -seq_file ../data/6-RSS/py/py_seq_pos.txt ../data/6-RSS/py/py_seq_neg.txt -label +1 -1 -bp 1 -metric AUC
The datasets used in manuscript can be found here BioSeq-BLM (Download).
BSD-2-Clause License
Prof. Dr. Bin Liu, email: bliu@bliulab.net
Citation Hong-Liang Li, Yi-He Pang, Bin Liu, BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models, Nucleic Acids Research, Volume 49, Issue 22, 16 December 2021, Page e129, https://doi.org/10.1093/nar/gkab829