8000 Batch Mode Support and Best Practices for Ligand Screening · Issue #31 · CBDD/openduck · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Batch Mode Support and Best Practices for Ligand Screening #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Jnelen opened this issue Apr 10, 2025 · 7 comments
Open

Batch Mode Support and Best Practices for Ligand Screening #31

Jnelen opened this issue Apr 10, 2025 · 7 comments

Comments

@Jnelen
Copy link
Jnelen commented Apr 10, 2025

Hi! I’ve been exploring this repo and I think that the whole concept is super exciting, with applications ranging from initial hit finding (as a post-docking filter) to hit optimization (accurate ranking of hit compounds) as has been documented across various excellent papers. Really impressive work!

I am trying to use it myself, but I had a few questions:

Batch Mode / CPU Parallelization:

Is there currently (planned) support for batch processing of ligands using OpenMM? Something akin to a built-in batch mode like what is offered with the AMBER backend?

I mostly have access to CPU-heavy clusters (limited GPU availability) and would love to analyze a few hundred ligands efficiently — ideally with SLURM support or parallelization across CPUs.

As a workaround, I’m thinking of writing a launcher script that submits one SLURM job per ligand. If I make it flexible enough, maybe it could be a useful addition to this repo? I'd be happy to share a working prototype once I’ve tested it further. I already have a Singularity container that runs smoothly with the OpenMM backend, and I’m working through additional validation now.

If there is support for OpenMM batch mode already, the launcher script could still help by submitting jobs for each batch, making it practical for systems with multiple GPUs. For example, if you have 400 ligands and 4 GPUs, the script could split them into 4 jobs of 100 ligands each and run them in parallel across the GPUs.

Recommended Settings:

  • How many SMD cycles do you recommend for virtual screening vs more accurate evaluations? Across papers, documentation and the tutorial, I have seen 5 suggested for VS, and 10–20 for higher precision runs. Is that still your go-to?

  • What’s your take on using Hydrogen Mass Repartitioning (HMR)? From your experience, does it significantly affect result quality, or is it generally a safe way to speed things up?

Performance Expectations:

Any ballpark estimates for how long it typically takes to process a single ligand in your setup (CPU vs GPU)? Just trying to calibrate my expectations.

Thanks again for the great work on this — really looking forward to experimenting with it more. Would love to contribute back if there's interest in batch mode support or other usability improvements!

Kind regards,
Jochem

@simonbray
Copy link
Collaborator

Hey @Jnelen,

not sure if it is actively being worked on, @AlvaroSmorras can answer that perhaps.

If you want to add code for your own use, I guess you should feel free. Whether your code will be merged in of course depends on whether the project remains actively maintained in the future.

How many SMD cycles do you recommend for virtual screening vs more accurate evaluations? Across papers, documentation and the tutorial, I have seen 5 suggested for VS, and 10–20 for higher precision runs. Is that still your go-to?

I use 8 for VS.

What’s your take on using Hydrogen Mass Repartitioning (HMR)? From your experience, does it significantly affect result quality, or is it generally a safe way to speed things up?

I have been advised to use it, but I haven't benchmarked it myself. The relevant paper is here: https://pubs.acs.org/doi/10.1021/ct5010406

Any ballpark estimates for how long it typically takes to process a single ligand in your setup (CPU vs GPU)?

In my experience, very roughly an hour on a single GPU for the complete 8 cycles, but most ligands terminate before completing all 8. Of course it depends a lot on the size of the protein, if you use chunking, HMR etc.

CPU-heavy clusters

Never tried it on CPU alone, but I guess it will be extremely slow.

I hope this helps.

@Jnelen
Copy link
Author
Jnelen commented Apr 17, 2025

Hey @Jnelen,

not sure if it is actively being worked on, @AlvaroSmorras can answer that perhaps.

If you want to add code for your own use, I guess you should feel free. Whether your code will be merged in of course depends on whether the project remains actively maintained in the future.

How many SMD cycles do you recommend for virtual screening vs more accurate evaluations? Across papers, documentation and the tutorial, I have seen 5 suggested for VS, and 10–20 for higher precision runs. Is that still your go-to?

I use 8 for VS.

What’s your take on using Hydrogen Mass Repartitioning (HMR)? From your experience, does it significantly affect result quality, or is it generally a safe way to speed things up?

I have been advised to use it, but I haven't benchmarked it myself. The relevant paper is here: https://pubs.acs.org/doi/10.1021/ct5010406

Any ballpark estimates for how long it typically takes to process a single ligand in your setup (CPU vs GPU)?

In my experience, very roughly an hour on a single GPU for the complete 8 cycles, but most ligands terminate before completing all 8. Of course it depends a lot on the size of the protein, if you use chunking, HMR etc.

CPU-heavy clusters

Never tried it on CPU alone, but I guess it will be extremely slow.

I hope this helps.

Thanks a lot for the information. I have a first version that works quite well for me with launching SLURM jobs using the openMM backend now. However, currently batching (so multiple compounds per execution) doesn't seem to be supported for openMM, so I'll look into mimicking this behaviour from the amber backend and implement it for openMM. If I can get this to work, I think it might be a nice addition, but as you indicate it is up to the main developers to make the choice to merge it!

@AlvaroSmorras
Copy link
Collaborator

Hi @Jnelen
I'm sorry that I didn't see this earlier.
The development is a bit stopped, but we are using it continuously. It would be great if you can do a pull request whenever you feel the batch execution mode is ready for openMM. Initially, I only implemented it for Amber, as we perform the simulations in different tasks in SLURM, but the preparation can be done locally. I guess for the openMM batch, to work as we currently do in AMBER, would need to access the complete node, which sometimes for GPU is a problem, but if you are going to use only CPUs it is going to be fine.

Regarding your questions, I believe @simonbray answered them. I'll only add that I have indeed studied the effect of using HMR and it does not impact the results at all. Here is an example with two benchmark datasets I tried (iridium-green, SERAPHiC-purple). Without a significant impact on the results, it spends almost half the wall clock time. Lets see if I find time to push the publication of the code and benchmarks soon.

Image

I usually launch 10 replicas, but 8 is perfectly ok. The secret relies in having a nice WQB threshold to stop the 'labile' binders early. With that, and HMR, they usually run at 800-1000 ns/day using RTX3080. I think the CPU execution will be rather slow.

@Jnelen
Copy link
Author
Jnelen commented Apr 23, 2025

Hi @Jnelen I'm sorry that I didn't see this earlier. The development is a bit stopped, but we are using it continuously. It would be great if you can do a pull request whenever you feel the batch execution mode is ready for openMM. Initially, I only implemented it for Amber, as we perform the simulations in different tasks in SLURM, but the preparation can be done locally. I guess for the openMM batch, to work as we currently do in AMBER, would need to access the complete node, which sometimes for GPU is a problem, but if you are going to use only CPUs it is going to be fine.

Regarding your questions, I believe @simonbray answered them. I'll only add that I have indeed studied the effect of using HMR and it does not impact the results at all. Here is an example with two benchmark datasets I tried (iridium-green, SERAPHiC-purple). Without a significant impact on the results, it spends almost half the wall clock time. Lets see if I find time to push the publication of the code and benchmarks soon.

Image

I usually launch 10 replicas, but 8 is perfectly ok. The secret relies in having a nice WQB threshold to stop the 'labile' binders early. With that, and HMR, they usually run at 800-1000 ns/day using RTX3080. I think the CPU execution will be rather slow.

Thanks for your input! I'll try to work on the openMM batching in the coming weeks as a side project. Would be wonderful to get it merged if I can finish it! Additionally, I think (hope) the launcher script can also be very convenient to run with openMM (Amber should also work), so hopefully that can also be a nice inclusion. I'll keep you updated and open a PR to review when it's ready!

@simonbray
Copy link
Collaborator

Regarding your questions, I believe @simonbray answered them. I'll only add that I have indeed studied the effect of using HMR and it does not impact the results at all. Here is an example with two benchmark datasets I tried (iridium-green, SERAPHiC-purple). Without a significant impact on the results, it spends almost half the wall clock time. Lets see if I find time to push the publication of the code and benchmarks soon.

Image

Thanks for sharing this @AlvaroSmorras, really nice to see. Did you apply a WQB cutoff threshold running these simulations? I imagine this could change the results quite significantly.

@AlvaroSmorras
Copy link
Collaborator
AlvaroSmorras commented Apr 29, 2025

Thanks for sharing this @AlvaroSmorras, really nice to see. Did you apply a WQB cutoff threshold running these simulations? I imagine this could change the results quite significantly.

@simonbray The values shown are the free energies calculated from the exponential average (Jarzynski) of the works and the error bars coming from bootstrapping the WQBs. I ran 20 replicas without a threshold, just to have the same sampling for each point, but there are some interactions with very high dispersion as a result.

Something that might be of interest, is that I have also been testing steering at higher speeds and it works with minimal differences too, so we could optimize the simulations also in that dimension. Still, I feel that the equilibration is what is taking longer (specially for the bad/labile ligands)

@Jnelen
Copy link
Author
Jnelen commented May 22, 2025

Hi @AlvaroSmorras,
I’ve opened a pull request (#32) that adds batch support to the OpenMM backend, as we discussed.

As a bonus, I’ve also included a convenient SLURM job launcher utility, which should be especially useful for screening a larger number of ligands after initial docking.

Would be great to get your thoughts when you have a chance to review it. Looking forward to your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0