GitHub

Repository for PDFBench: A Benchmark for De novo Protein Design from Function

The paper can be viewed on the homepage: https://pdfbench.github.io/

1. Environment

conda create -n PDF --file requirements.txt
conda activate PDF

2. Preparation

2.1. ProTrek and EvoLlama

Download repository for both ProTrek and EvoLlama into src folder, and download the ProTrek-650M weights and EvoLlama weights following their guidelines.

2.2. TMscore

Download TMscore following the ZhangGroup. According to the guidance, your directory may look like:

cd /path/to/TMscore
tree . -L 1
├── TMscore # Executable file of TMscore, it will be used later.
└── TMscore.cpp

The path to TMscore executable file is /path/to/TMscore/TMscore.

2.3. InterProScan

InterProScan needs Java11!

cd /path/to/interproscan
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.74.105/interproscan-5.74.105-64-bit.tar.gz
tar -zxvf ./interproscan-5.74.105-64-bit.tar.gz

The path to InterProScan executable file is /path/to/interproscan/interproscan-5.73-105-64/interproscan.sh.

2.4. MMseqs2 and its database

Download MMseqs2 following the tutorial.According to the guidance, your directory may look like:

cd /path/to/mmseqs
tree . -L 2
├── bin
│   └── mmseqs  # Execuatable file of MMSeqs2
├── examples
├── LICENSE.md
├── matrices
├── README.md
├── userguide.pdf
└── util

The path to executable file of MMSeqs2 is /path/to/mmseqs/bin/mmseqs.

cd /path/to/mmseqs
mkdir DB && cd DB

# Downloading UniProtKB/SwissProt (~400M)
wget https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
# Downloading UniProtKB/Trembl (~100G)
wget https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/uniprot_trembl.fasta.gz
gunzip uniprot_trembl.fasta.gz

# Concatenating two subsets to UniProtKBs
cat uniprot_sprot.fasta uniprot_trembl.fasta > uniprot.fasta

# Creating searching db of CPU version.
mmseqs createdb uniprot.fasta ./uniprotdb
# Converting searching db to GPU version.
mmseqs makepaddedseqdb ./uniprotdb ./uniprotdb_gpu
# Craeating indexes for searching acceleration. (tens of hours)
mmseqs createindex ./uniprotdb_gpu tmp --index-subset 2

The path to MMSeqs DB is /path/to/mmseqs/DB/uniprotdb or /path/to/mmseqs/DB/uniprotdb_gpu. Warning: The searching DB bulit from UniProtKB takes up about 500 GB of disk space and runs for tens of hours, and it takes nearly 1 hour to complete the searching of one single sequence if no GPU acceleration!

2.5. ESMFold

Download ESMFold from huggingface, and the path to ESMFold weights is /path/to/esmfold/weights/folder

2.6. Modify your function-parser

See src/utils.py, we provide two parsers for Mol-Instructions and CAMEOTest as follows,

"""
Keyword-guided Task
"""
source: str = "Generate a protein sequence for a novel protein that integrates the following function keywords: Cyt_c-like_dom. The designed protein sequence is "
def get_text_from_keywords(instruction: str) -> str:
    # Function for parsing keywords from text
    keyword = instruction.removesuffix("The designed protein sequence is ")
    keyword = re.search(r":\s*(.*)", keyword[:-2]).group(1)
    return keyword.strip()
keywords: str = get_text_from_keywords(source)  
# Keywords 'Cyt_c-like_dom' left only

"""
Description-guided Task
"""
source: str = "Synthesize a protein sequence with the appropriate folding and stability properties for the desired function. 1. The protein should be able to modulate glycine decarboxylation via glycine cleavage system in a way that leads to a desirable outcome. The designed protein sequence is "
def get_text_from_description(instruction: str) -> str:
    # Function for parse description from text
    function = re.sub(r"^.*?(1\.)", r"\1", instruction)
    function = function.removesuffix("The designed protein sequence is ")
    return function.strip()
description: str = get_text_from_description(source)
# Additional prompt 'The designed protein sequence is ' is deleted.

If your Function description/keyword differs, you must modify these two functions to get coorect performance in ProTrek Score, Evollama Score and Retrieval Accuracy.

3. Prepare your evaluation data

we highly recommend that you organize evaluation data like us, see ./example/data/example_data.json

instruction: Protein functions described in natural language
reference: Ground Truth protein sequence
response: Designed protein sequence

4. Let's Go Evaluation!

We Provide two examples for single and batch evaluation. You may edit your preparation in ./scripts/eval.sh following them.

zsh scripts/eval.sh

Note: we provide example result files, which should be deleted initially.

Cite this work

@misc{kuang2025pdfbenchbenchmarknovoprotein,
      title={PDFBench: A Benchmark for De novo Protein Design from Function}, 
      author={Jiahao Kuang and Nuowei Liu and Changzhi Sun and Tao Ji and Yuanbin Wu},
      year={2025},
      eprint={2505.20346},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.20346}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Repository for PDFBench: A Benchmark for De novo Protein Design from Function

1. Environment

2. Preparation

2.1. ProTrek and EvoLlama

2.2. TMscore

2.3. InterProScan

2.4. MMseqs2 and its database

2.5. ESMFold

2.6. Modify your function-parser

3. Prepare your evaluation data

4. Let's Go Evaluation!

Cite this work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
example		example
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

PDFBench/PDFBench

Folders and files

Latest commit

History

Repository files navigation

Repository for PDFBench: A Benchmark for De novo Protein Design from Function

1. Environment

2. Preparation

2.1. ProTrek and EvoLlama

2.2. TMscore

2.3. InterProScan

2.4. MMseqs2 and its database

2.5. ESMFold

2.6. Modify your function-parser

3. Prepare your evaluation data

4. Let's Go Evaluation!

Cite this work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages