8000 GitHub - Zimiao1025/BioSeq-BLM: BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Zimiao1025/BioSeq-BLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioSeq-BLM: a platform for analyzing DNA, RNA, and protein sequences based on biological language models

System Requirements

Software Requirements:

BioSeq-BLM has been tested on Windows, Ubuntu 16.04, and 18.04 operating systems.

Installation

virtualenv

virtualenv -p python3.7 venv

source ./venv/bin/activate

pip install -r requirements.txt

Anaconda

conda create -n venv python=3.7

conda activate venv

pip install -r requirements.txt

Not Necessary Softwares

Usage and examples

Directory Structure Description

BioSeq-BLM
├───code // python source code for stand-alone package.
│
├───data // Used to place example datasets.
│
├───docs // Manual about stand-alone package.
│
├───results // After running the code, the output results can be found here.
|
├───scripts // Used to place the scripts selecting the best algorithms automatically.
|
├───software // Used to place the "Not Necessary Softwares" in installation.
|
|───LICENSE // the license.
│
|───README.md // repository description.
|
└───requirements.txt // Necessary file for installation.

Examples

  1. Download the datasets from the BioSeq-BLM (Download), unzip them and put them in the '/data' folder.

  2. Enter the '/code' directory and run the following command lines.

1 Identification DNase I hypersensitive sites

python BioSeq-BLM_Seq.py -category DNA -mode TF-IDF -words Mismatch -word_size 4 -cl Kmeans -nc 5 -dr PCA -np 64 -fs F-value -nf 128 -rdb fs -ml SVM -cost 4 -gamma -1 -sp combine -seq_file ../data/1-DHSs/dna_pos.txt ../data/1-DHSs/dna_neg.txt -label +1 -1

2 Identification of real microRNA precursors

861E
python BioSeq-BLM_Seq.py -category RNA -mode OHE -method RSS -cl Kmeans -nc 5 -fs MIC -nf 128 -dr TSVD -np 128 -rdb dr -ml SVM -cost 1 -gamma -4 -seq_file ../data/2-miRNA/rna_pos.txt ../data/2-miRNA/rna_neg.txt -rss_file ../data/2-miRNA/rna_with_2rd_structure.txt -label +1 -1

3 Identification of DNA binding proteins

python BioSeq-BLM_Seq.py -category Protein -mode TM -method LSA -in_tm BOW -words Top-N-Gram -top_n 2 -com_prop 0.7 -sn L1-normalize -cl Kmeans -nc 5 -fs Tree -nf 128 -dr KernelPCA -np 128 -rdb dr -ml RF -seq_file ../data/3-DBPs/Protein_pos.txt ../data/3-DBPs/Protein_neg.txt -label +1 -1

4 Identification of intrinsically disordered regions in proteins

python BioSeq-BLM_Res.py -category Protein -method BLOSUM62 -ml LSTM -epoch 10 -lr 0.01 -dropout 0.5 -batch_size 20 -fixed_len 300 -n_layer 2 -hidden_dim 64 -seq_file ../data/4-IDRs/protein_seq.txt -label_file ../data/4-IDRs/protein_label.txt

5 RNA-binding protein identification

python BioSeq-BLM_Seq.py -category Protein -mode TM -method LSA -in_tm BOW -words Top-N-Gram -top_n 2 -cl Kmeans -nc 5 -fs Tree -nf 128 -dr TSVD -np 128 -rdb no -ml SVM -seq_file ../data/5-RBPs/RBP_590.txt ../data/5-RBPs/NRBP_590.txt -label +1 -1

6 RNA secondary structure prediction

BioSeq-BLM_Seq.py -category DNA -mode OHE -method One-hot -ml RF -seq_file ../data/6-RSS/ph/ph_seq_pos.txt ../data/6-RSS/ph/ph_seq_neg.txt -label +1 -1 -bp 1 -metric AUC
BioSeq-BLM_Seq.py -category DNA -mode OHE -method One-hot -ml RF -seq_file ../data/6-RSS/py/py_seq_pos.txt ../data/6-RSS/py/py_seq_neg.txt -label +1 -1 -bp 1 -metric AUC

datasets

The datasets used in manuscript can be found here BioSeq-BLM (Download).

License

BSD-2-Clause License

Contact

Prof. Dr. Bin Liu, email: bliu@bliulab.net

Citation Hong-Liang Li, Yi-He Pang, Bin Liu, BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models, Nucleic Acids Research, Volume 49, Issue 22, 16 December 2021, Page e129, https://doi.org/10.1093/nar/gkab829

About

BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0