Batch Mode Support and Best Practices for Ligand Screening #31

Jnelen · 2025-04-10T10:16:20Z

Hi! I’ve been exploring this repo and I think that the whole concept is super exciting, with applications ranging from initial hit finding (as a post-docking filter) to hit optimization (accurate ranking of hit compounds) as has been documented across various excellent papers. Really impressive work!

I am trying to use it myself, but I had a few questions:

Batch Mode / CPU Parallelization:

Is there currently (planned) support for batch processing of ligands using OpenMM? Something akin to a built-in batch mode like what is offered with the AMBER backend?

I mostly have access to CPU-heavy clusters (limited GPU availability) and would love to analyze a few hundred ligands efficiently — ideally with SLURM support or parallelization across CPUs.

As a workaround, I’m thinking of writing a launcher script that submits one SLURM job per ligand. If I make it flexible enough, maybe it could be a useful addition to this repo? I'd be happy to share a working prototype once I’ve tested it further. I already have a Singularity container that runs smoothly with the OpenMM backend, and I’m working through additional validation now.

If there is support for OpenMM batch mode already, the launcher script could still help by submitting jobs for each batch, making it practical for systems with multiple GPUs. For example, if you have 400 ligands and 4 GPUs, the script could split them into 4 jobs of 100 ligands each and run them in parallel across the GPUs.

Recommended Settings:

How many SMD cycles do you recommend for virtual screening vs more accurate evaluations? Across papers, documentation and the tutorial, I have seen 5 suggested for VS, and 10–20 for higher precision runs. Is that still your go-to?
What’s your take on using Hydrogen Mass Repartitioning (HMR)? From your experience, does it significantly affect result quality, or is it generally a safe way to speed things up?

Performance Expectations:

Any ballpark estimates for how long it typically takes to process a single ligand in your setup (CPU vs GPU)? Just trying to calibrate my expectations.

Thanks again for the great work on this — really looking forward to experimenting with it more. Would love to contribute back if there's interest in batch mode support or other usability improvements!

Kind regards,
Jochem

simonbray · 2025-04-15T14:54:15Z

Hey @Jnelen,

not sure if it is actively being worked on, @AlvaroSmorras can answer that perhaps.

If you want to add code for your own use, I guess you should feel free. Whether your code will be merged in of course depends on whether the project remains actively maintained in the future.

How many SMD cycles do you recommend for virtual screening vs more accurate evaluations? Across papers, documentation and the tutorial, I have seen 5 suggested for VS, and 10–20 for higher precision runs. Is that still your go-to?

I use 8 for VS.

What’s your take on using Hydrogen Mass Repartitioning (HMR)? From your experience, does it significantly affect result quality, or is it generally a safe way to speed things up?

I have been advised to use it, but I haven't benchmarked it myself. The relevant paper is here: https://pubs.acs.org/doi/10.1021/ct5010406

Any ballpark estimates for how long it typically takes to process a single ligand in your setup (CPU vs GPU)?

In my experience, very roughly an hour on a single GPU for the complete 8 cycles, but most ligands terminate before completing all 8. Of course it depends a lot on the size of the protein, if you use chunking, HMR etc.

CPU-heavy clusters

Never tried it on CPU alone, but I guess it will be extremely slow.

I hope this helps.

Jnelen · 2025-04-17T13:35:42Z

Hey @Jnelen,

not sure if it is actively being worked on, @AlvaroSmorras can answer that perhaps.

If you want to add code for your own use, I guess you should feel free. Whether your code will be merged in of course depends on whether the project remains actively maintained in the future.

How many SMD cycles do you recommend for virtual screening vs more accurate evaluations? Across papers, documentation and the tutorial, I have seen 5 suggested for VS, and 10–20 for higher precision runs. Is that still your go-to?

I use 8 for VS.

What’s your take on using Hydrogen Mass Repartitioning (HMR)? From your experience, does it significantly affect result quality, or is it generally a safe way to speed things up?

I have been advised to use it, but I haven't benchmarked it myself. The relevant paper is here: https://pubs.acs.org/doi/10.1021/ct5010406

Any ballpark estimates for how long it typically takes to process a single ligand in your setup (CPU vs GPU)?

In my experience, very roughly an hour on a single GPU for the complete 8 cycles, but most ligands terminate before completing all 8. Of course it depends a lot on the size of the protein, if you use chunking, HMR etc.

CPU-heavy clusters

Never tried it on CPU alone, but I guess it will be extremely slow.

I hope this helps.

Thanks a lot for the information. I have a first version that works quite well for me with launching SLURM jobs using the openMM backend now. However, currently batching (so multiple compounds per execution) doesn't seem to be supported for openMM, so I'll look into mimicking this behaviour from the amber backend and implement it for openMM. If I can get this to work, I think it might be a nice addition, but as you indicate it is up to the main developers to make the choice to merge it!

AlvaroSmorras · 2025-04-23T09:08:10Z

Hi @Jnelen
I'm sorry that I didn't see this earlier.
The development is a bit stopped, but we are using it continuously. It would be great if you can do a pull request whenever you feel the batch execution mode is ready for openMM. Initially, I only implemented it for Amber, as we perform the simulations in different tasks in SLURM, but the preparation can be done locally. I guess for the openMM batch, to work as we currently do in AMBER, would need to access the complete node, which sometimes for GPU is a problem, but if you are going to use only CPUs it is going to be fine.

Regarding your questions, I believe @simonbray answered them. I'll only add that I have indeed studied the effect of using HMR and it does not impact the results at all. Here is an example with two benchmark datasets I tried (iridium-green, SERAPHiC-purple). Without a significant impact on the results, it spends almost half the wall clock time. Lets see if I find time to push the publication of the code and benchmarks soon.

I usually launch 10 replicas, but 8 is perfectly ok. The secret relies in having a nice WQB threshold to stop the 'labile' binders early. With that, and HMR, they usually run at 800-1000 ns/day using RTX3080. I think the CPU execution will be rather slow.

Jnelen · 2025-04-23T12:16:41Z

Hi @Jnelen I'm sorry that I didn't see this earlier. The development is a bit stopped, but we are using it continuously. It would be great if you can do a pull request whenever you feel the batch execution mode is ready for openMM. Initially, I only implemented it for Amber, as we perform the simulations in different tasks in SLURM, but the preparation can be done locally. I guess for the openMM batch, to work as we currently do in AMBER, would need to access the complete node, which sometimes for GPU is a problem, but if you are going to use only CPUs it is going to be fine.

Regarding your questions, I believe @simonbray answered them. I'll only add that I have indeed studied the effect of using HMR and it does not impact the results at all. Here is an example with two benchmark datasets I tried (iridium-green, SERAPHiC-purple). Without a significant impact on the results, it spends almost half the wall clock time. Lets see if I find time to push the publication of the code and benchmarks soon.

I usually launch 10 replicas, but 8 is perfectly ok. The secret relies in having a nice WQB threshold to stop the 'labile' binders early. With that, and HMR, they usually run at 800-1000 ns/day using RTX3080. I think the CPU execution will be rather slow.

Thanks for your input! I'll try to work on the openMM batching in the coming weeks as a side project. Would be wonderful to get it merged if I can finish it! Additionally, I think (hope) the launcher script can also be very convenient to run with openMM (Amber should also work), so hopefully that can also be a nice inclusion. I'll keep you updated and open a PR to review when it's ready!

simonbray · 2025-04-28T14:15:25Z

Regarding your questions, I believe @simonbray answered them. I'll only add that I have indeed studied the effect of using HMR and it does not impact the results at all. Here is an example with two benchmark datasets I tried (iridium-green, SERAPHiC-purple). Without a significant impact on the results, it spends almost half the wall clock time. Lets see if I find time to push the publication of the code and benchmarks soon.

Thanks for sharing this @AlvaroSmorras, really nice to see. Did you apply a WQB cutoff threshold running these simulations? I imagine this could change the results quite significantly.

AlvaroSmorras · 2025-04-29T09:06:36Z

Thanks for sharing this @AlvaroSmorras, really nice to see. Did you apply a WQB cutoff threshold running these simulations? I imagine this could change the results quite significantly.

@simonbray The values shown are the free energies calculated from the exponential average (Jarzynski) of the works and the error bars coming from bootstrapping the WQBs. I ran 20 replicas without a threshold, just to have the same sampling for each point, but there are some interactions with very high dispersion as a result.

Something that might be of interest, is that I have also been testing steering at higher speeds and it works with minimal differences too, so we could optimize the simulations also in that dimension. Still, I feel that the equilibration is what is taking longer (specially for the bad/labile ligands)

Jnelen · 2025-05-22T08:35:30Z

Hi @AlvaroSmorras,
I’ve opened a pull request (#32) that adds batch support to the OpenMM backend, as we discussed.

As a bonus, I’ve also included a convenient SLURM job launcher utility, which should be especially useful for screening a larger number of ligands after initial docking.

Would be great to get your thoughts when you have a chance to review it. Looking forward to your feedback!

Jnelen mentioned this issue May 14, 2025

Add batch support and SLURM launcher for OpenMM backend #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch Mode Support and Best Practices for Ligand Screening #31

Batch Mode Support and Best Practices for Ligand Screening #31

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Batch Mode Support and Best Practices for Ligand Screening #31

Batch Mode Support and Best Practices for Ligand Screening #31

Comments

Batch Mode / CPU Parallelization:

Recommended Settings:

Performance Expectations:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!