BioSeq-BLM: a platform for analyzing DNA, RNA, and protein sequences based on biological language models

System Requirements

Software Requirements:

Python3
virtualenv or Anaconda
CUDA 10.0 (Optional If using GPU)
cuDNN (>= 7.4.1) (Optional If using GPU)

BioSeq-BLM has been tested on Windows, Ubuntu 16.04, and 18.04 operating systems.

Installation

virtualenv

virtualenv -p python3.7 venv

source ./venv/bin/activate

pip install -r requirements.txt

Anaconda

conda create -n venv python=3.7

conda activate venv

pip install -r requirements.txt

Not Necessary Softwares

Usage and examples

Directory Structure Description

BioSeq-BLM
├───code // python source code for stand-alone package.
│
├───data // Used to place example datasets.
│
├───docs // Manual about stand-alone package.
│
├───results // After running the code, the output results can be found here.
|
├───scripts // Used to place the scripts selecting the best algorithms automatically.
|
├───software // Used to place the "Not Necessary Softwares" in installation.
|
|───LICENSE // the license.
│
|───README.md // repository description.
|
└───requirements.txt // Necessary file for installation.

Examples

Download the datasets from the BioSeq-BLM (Download), unzip them and put them in the '/data' folder.

Enter the '/code' directory and run the following command lines.

1 Identification DNase I hypersensitive sites

python BioSeq-BLM_Seq.py -category DNA -mode TF-IDF -words Mismatch -word_size 4 -cl Kmeans -nc 5 -dr PCA -np 64 -fs F-value -nf 128 -rdb fs -ml SVM -cost 4 -gamma -1 -sp combine -seq_file ../data/1-DHSs/dna_pos.txt ../data/1-DHSs/dna_neg.txt -label +1 -1

2 Identification of real microRNA precursors

861E

python BioSeq-BLM_Seq.py -category RNA -mode OHE -method RSS -cl Kmeans -nc 5 -fs MIC -nf 128 -dr TSVD -np 128 -rdb dr -ml SVM -cost 1 -gamma -4 -seq_file ../data/2-miRNA/rna_pos.txt ../data/2-miRNA/rna_neg.txt -rss_file ../data/2-miRNA/rna_with_2rd_structure.txt -label +1 -1

3 Identification of DNA binding proteins

python BioSeq-BLM_Seq.py -category Protein -mode TM -method LSA -in_tm BOW -words Top-N-Gram -top_n 2 -com_prop 0.7 -sn L1-normalize -cl Kmeans -nc 5 -fs Tree -nf 128 -dr KernelPCA -np 128 -rdb dr -ml RF -seq_file ../data/3-DBPs/Protein_pos.txt ../data/3-DBPs/Protein_neg.txt -label +1 -1

4 Identification of intrinsically disordered regions in proteins

python BioSeq-BLM_Res.py -category Protein -method BLOSUM62 -ml LSTM -epoch 10 -lr 0.01 -dropout 0.5 -batch_size 20 -fixed_len 300 -n_layer 2 -hidden_dim 64 -seq_file ../data/4-IDRs/protein_seq.txt -label_file ../data/4-IDRs/protein_label.txt

5 RNA-binding protein identification

python BioSeq-BLM_Seq.py -category Protein -mode TM -method LSA -in_tm BOW -words Top-N-Gram -top_n 2 -cl Kmeans -nc 5 -fs Tree -nf 128 -dr TSVD -np 128 -rdb no -ml SVM -seq_file ../data/5-RBPs/RBP_590.txt ../data/5-RBPs/NRBP_590.txt -label +1 -1

6 RNA secondary structure prediction

BioSeq-BLM_Seq.py -category DNA -mode OHE -method One-hot -ml RF -seq_file ../data/6-RSS/ph/ph_seq_pos.txt ../data/6-RSS/ph/ph_seq_neg.txt -label +1 -1 -bp 1 -metric AUC

BioSeq-BLM_Seq.py -category DNA -mode OHE -method One-hot -ml RF -seq_file ../data/6-RSS/py/py_seq_pos.txt ../data/6-RSS/py/py_seq_neg.txt -label +1 -1 -bp 1 -metric AUC

datasets

The datasets used in manuscript can be found here BioSeq-BLM (Download).

License

BSD-2-Clause License

Contact

Prof. Dr. Bin Liu, email: bliu@bliulab.net

Citation Hong-Liang Li, Yi-He Pang, Bin Liu, BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models, Nucleic Acids Research, Volume 49, Issue 22, 16 December 2021, Page e129, https://doi.org/10.1093/nar/gkab829

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BioSeq-BLM: a platform for analyzing DNA, RNA, and protein sequences based on biological language models

System Requirements

Installation

virtualenv

Anaconda

Not Necessary Softwares

Usage and examples

Directory Structure Description

Examples

1 Identification DNase I hypersensitive sites

2 Identification of real microRNA precursors

3 Identification of DNA binding proteins

4 Identification of intrinsically disordered regions in proteins

5 RNA-binding protein identification

6 RNA secondary structure prediction

datasets

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Zimiao1025/BioSeq-BLM

Folders and files

Latest commit

History

Repository files navigation

BioSeq-BLM: a platform for analyzing DNA, RNA, and protein sequences based on biological language models

System Requirements

Installation

virtualenv

Anaconda

Not Necessary Softwares

Usage and examples

Directory Structure Description

Examples

1 Identification DNase I hypersensitive sites

2 Identification of real microRNA precursors

3 Identification of DNA binding proteins

4 Identification of intrinsically disordered regions in proteins

5 RNA-binding protein identification

6 RNA secondary structure prediction

datasets

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages