Adding bond types & orders to standard amino acid bonds and cys-cys bonds #3770

RiesBen · 2022-09-08T13:55:34Z

Hi openmm,
over at openfe/gufe, we wanted to use the PDBFile parser for reading in protein pdbs here! :) But we want to use gufe with openff and therefore need to have the bond-orders. This is currently not possible but easily doable with minor changes to your implementation (as nearly everything is already there). I connected the loose ends and want to suggest these small, non-invasive changes to the topology class in openmm.

Changes:

add bond-Type and bond-Order to openmm/wrappers/python/openmm/app/data/residues.xml
change topology generation, such that the default 'None' is overwritten by the value in residues.xml for std amino acid residues and nucleic acids / and add it to disulfid bonds - openmm/wrappers/python/openmm/app/topology.py

Please let me know what you think! :)

peastman · 2022-09-08T15:28:37Z

Thanks! That looks really interesting.

How did you generate the updated residues.xml? Did you fill in the bond types by hand, or did you generate them programmatically? Either way, where did the values come from?

One concern I have is that setting bond types from PDB files will always be inconsistent. Standard bonds will be marked correctly, but any nonstandard ones from CONECT records (anything other than proteins, nucleic acids, and water, as well as covalent modifications to proteins) won't be. I worry that people will see bond types being present, assume that means all bonds have specified types, and write code like this.

for bond in topology.bonds():
    if bond.type == 'Single':
        # Do something that's supposed to happen for every single bond, but actually won't.

Does anyone have thoughts on this?

richardjgowers · 2022-09-08T16:40:08Z

@peastman my best answer would be to have them have bond.order = None (or 0 etc) and bond.type = None / 'Unknown', but I agree that this doesn't stop people making bad decisions. In general getting bond orders right on nonstandard parts is something that has to happen after this step.

j-wags · 2022-09-08T17:27:01Z

A few points from the OpenFF side:

It's probably not necessary to store Single vs Aromatic - The aromaticity of a bond can be determined by looking at the complete chemical graph (this is how it's done in OpenFF, and the user-set bond aromaticity gets readily overwritten).

my best answer would be to have them have bond.order = None (or 0 etc)

This seems like a good idea - It may be better long-term to ensure bond.order is an int. Maybe -1 could be a good value to disambiguate "unknown" (bond.order=-1) from "metal chelating/dative(?)" (bond.order=0) bonds.
I'd like to advocate for the addition of formal charge in here - I'm looking for a concrete example but something like a resonance-stabilized sulfur radical may make it hard to unambiguously determine the total charge of a molecule from bond orders alone. I could be wrong here but am having trouble finding a fundamental explanation either way.
OpenFF uses a tool that ingests a mmCIF file (we use aa_variants.cif from the RCSB) and produces a list of chemical substructures (including variants where "leaving atoms" have left). I'd be happy to work with someone on repurposing this to make a residues.xml file.

…/github.com/RiesBen/openmm into Adding_bondTypes_Orders_to_stdBonds_CCBonds

RiesBen · 2022-09-09T14:42:50Z

Hi all,
I added a few lines to deal the different protonation patterns of Histidine, such the standard bond order is adapted:

@peastman I generated the .xml file semi-automatically based on Lehninger Principles of Biochemistry (text-book)

@j-wags @richardjgowers I like the idea of having None bond-order for undefined and maybe "Undefined" for bond-type?

@j-wags if formal charges should be added, I think much more of the code needs to be significantly modified.I would like to suggest to split this into a second PR?

peastman · 2022-09-09T17:50:00Z

I added a few lines to deal the different protonation patterns of Histidine, such the standard bond order is adapted:

What about other residues that have multiple protonation states? Do they need special treatment too?

I like the idea of having None bond-order for undefined and maybe "Undefined" for bond-type?

That's how it works right now. The difference is that currently all of them are undefined, because we don't read any file formats that provide bond orders. There are two possible problems we need to consider:

Most bonds have defined orders, but a few are undefined, leading users to make mistakes.
Some bond orders are incorrect.

Both of these are concerns, but of course the second one is a much bigger concern. As you observed, simply adding or removing a hydrogen can change nearby bonds. Same with any covalent modification. How confident are we that any bond orders that get defined will be correct?

For what it's worth, tools like RdKit and OpenBabel try to automatically infer bond orders and formal charges, and in my experience neither one of them is reliable.

jchodera · 2022-09-13T16:51:30Z

Just commenting regarding another use case that could potentially be enabled here: If we can get to the point where we have {atom elements and formal charges, and bonds with Kekulized integral bond orders} will be represented in Topology after loading and prepping a PDB structure, this will be fantastic, since we could then have espaloma also parameterize biopolymers in openmmforcefields in the short term (e.g. proteins with post-translational modificaitons or covalent ligands, complex heterogeneous systems, etc).

8000

peastman · 2022-09-13T18:24:08Z

For reference, here's the code OpenBabel uses to infer bond orders. It uses a lot of geometric criteria to identify special cases.

RDKit's PDB reader uses a method more like the one here, with a table of which bonds in standard residues are double bonds. Though the code is a bit horrifying to look at. And of course, that only works for standard residues, nothing else.

j-wags · 2022-09-13T21:14:51Z

I'm going to meet with @RiesBen and @richardjgowers tomorrow to chat more about OpenFF's and OpenFE's needs regarding PDB loading, and whether it would be best for the code to live in OpenMM or elsewhere. But based on a quick discussion today, it seems like @richardjgowers and I are in agreement that formal charge is needed to unambiguously identify a chemical species. I think my confusion was around whether this PR was intended to support complete chemical identification when loading from PDB.

My major concern in this discussion was that, if we wanted this PR to enable identifying chemical species from PDB files, we'd need to implement formal charge assignment as well, and implementing atomic formal charge in OpenMM's Topology class would be a significant API/behavior change. So my concerns about this PR were along the lines of "if Peter is as resistant to behavior changes as I am, and we know we'll eventually want formal charge assigners as well, then it should all go in one PR".

@RiesBen and @richardjgowers' motivation, if I understand this correctly, is "we've already got all of the bond order assignment stuff lined up in a fork that we're using, we may as well upstream it since it's not an API change". This also makes sense - OpenMM Topologies already have a field for bond order+type information, so there's little harm in filling it in.

So, if we're in agreement that the goal of this PR isn't "fill in complete chemical information for molecules loaded from PDB", then my major concern about this functionality is resolved!

peastman · 2022-09-13T21:32:56Z

That seems fine. My main concern at this point is just about correctness. How confident are we that the bond information it fills in will always be accurate, including for different protonation states, different types of chain termination, and proteins with post-translational modifications?

jchodera · 2022-09-13T23:16:56Z

The correct long-term solution here---which OpenFF has already started to roll out---is to use the Chemical Components Dictionary as the canonical reference of residues described in the PDB, since the formal charges and bond orders there for all acceptable variants in the PDB are provided and validated.

peastman · 2022-09-13T23:32:44Z

It's correct as long as 1) your file only contains things that have been submitted to the PDB, and 2) it refers to everything with the correct names. In practice, a lot of PDB files don't meet those requirements.

jchodera · 2022-09-13T23:42:12Z

It's correct as long as 1) your file only contains things that have been submitted to the PDB, and 2) it refers to everything with the correct names. In practice, a lot of PDB files don't meet those requirements.

(1) PDB and mmCIF formats were developed to represent things that came from the PDB. A different format is really necessary for specifying residues or molecules not contained in the PDB---for the exact reason that the PDB does not provide sufficient information to define the chemistry. Their use for representing other things is a misuse and something to provide "best effort" support for, but not something that should preclude supporting their primary intended purpose.

(2) The PDB enforces this naming convention, but you could always use the molecular graph to match residues. The difficulty is that you would need to know how to transform the complete molecules in the CCD into common residues of those molecules (which are only parts of molecules)---a problem that the OpenFF folks have likely examined in much more detail.

peastman · 2022-09-14T00:14:18Z

That may have been the original intent, but it's not how they're actually used. PDB files are very widely used as an interchange format. They often contain things that aren't in the CCD, and they almost never follow standard naming for anything that isn't a standard residue.
This assumes you're able to identify all the bonds, which frequently is not the case. It comes back to the issue that most PDB files aren't from RCSB and don't follow the spec.

Even if you assume you're dealing with a file straight from RCSB, there generally isn't enough information to determine what you want. That's because they usually don't have hydrogens, which makes protonation states ambiguous. That in turn means that bond orders and formal charges are ambiguous. It's impossible to assign them when you load the file. You have to wait until hydrogens are added.

RiesBen · 2022-09-14T08:26:05Z

@peastman
So from a biochemical point of view, if we only limit ourselves to what the residues.xml file provides right now, even though there are protonations (that are not covered by the atom sets in residues.xml), the bond orders will stay correct:

regarding PTMs, the standard stuff like phosphorylation, sulfatization, glycolization, etc. should not break the current implementation, if we would estimate the bond-orders/add them to the residue.xml, as they should be well behaved.

Example(here serine without phosphate would have a -1 fc, with it would be neutral):

(credit goes to wikipedia)

Alternative problems may be coming from different oxidation states of many-electron metals (Fe, etc.).

@peastman do you have a specific case in mind, that concerns you? :)

peastman · 2022-09-14T19:35:29Z

@peastman do you have a specific case in mind, that concerns you? :)

I worry about the cases I know about, but also the ones I don't know about! :)

This feature will definitely need a good set of tests. As a starting point, can we verify that the following cases are all handled correctly?

For all the standard protonation variants we know about, load them from a PDB file and check that the bond orders get assigned correctly.
For all the same variants, load a PDB file without hydrogens, call Modeller.addHydrogens() to add them, and verify the bond orders end up correct.

We should check all the variants supported by addHydrogens(), and also a few ways of terminating protein and nucleic acid chains. (Nucleic acids especially can be terminated in a few different ways that are all common in the PDB.)

RiesBen · 2022-09-23T17:24:56Z

Regarding the PTMs this is a cool collection of 'frequent' PTMs. (I guess in a lot of modelling approaches they are still rare)

(credit goes to: https://pubs.rsc.org/en/content/articlelanding/2011/MB/C0MB00216J#!divRelatedContent&articles)

I think besides the ubiquitin, they could be easily realized with templates like the residue.xml already contains (either recognize them separate from amino acids (AA) or as modified AAs - only a couple of AAs are affected), consequently the current implementation would not need to change.

hm I'm off now for some time (3 Weeks), but we can discuss the next steps, when I'm back. :)

aizvorski · 2022-10-19T20:14:20Z

@RiesBen @peastman I just wanted to say this is an incredibly useful PR. Can we merge this in before the next release?

peastman · 2022-10-19T22:39:40Z

There's a lot of design and testing still needed first. See the discussion above, for example #3770 (comment).

aizvorski · 2022-10-20T04:33:46Z

@peastman I made a pdb with every aminoacid variant, sequence is GLY-ASH-ASP-CYS-CYX-GLH-GLU-HID-HIE-HIP-HIN-LYN-LYS. Also, to check different caps, GLY-GLY and ACE-GLY-GLY-NME. These are made in pymol.

Unfortunately these don't load into openmm 7.7 out of the box, even without the changes in this PR.

Code:

import openmm as mm
from openmm import app
protein_filename = 'variants.pdb'
pdb = app.PDBFile(protein_filename)
forcefield = app.ForceField("amber14-all.xml", "amber14/tip3pfb.xml")
system = forcefield.createSystem(
    pdb.topology,
    nonbondedMethod=app.NoCutoff,
)

Result:
variants.pdb - "ValueError: No template found for residue 1 (GLY). The set of atoms matches NGLY, but the bonds are different"
glygly.pdb - "ValueError: No template found for residue 1 (GLY). The set of atoms matches NGLY, but the bonds are different"
aceglyglynme.pdb - "ValueError: No template found for residue 2 (GLY). The set of atoms matches GLY, but the bonds are different"

The pdbs seem fine to me by looking through them. Could you help figure out what it is that openmm doesn't like about them?

(Possible clue: pdb4amber -i glygly.pdb -o glygly_modified.pdb --add-missing-atoms --nohyd produces a file which does load, the only changes seem to be the names and locations of the hydrogens. Doing the same to variants.pdb also loads, but changes a bunch of the protonation states)

variants.pdb.txt
glygly.pdb.txt
aceglyglynme.pdb.txt

aizvorski · 2022-10-20T04:58:39Z

@peastman Once we do get variants.pdb loading, how can we check the bond orders and types? We could inspect those manually and set up a regression test. Alternatively, we could compare with the bond orders when the same pdb is read in by oechem.

richardjgowers · 2022-10-20T13:30:16Z

@aizvorski I think a better gold set would be the PDB Chemical Components Dictionary: https://www.wwpdb.org/data/ccd

peastman · 2022-10-20T16:07:53Z

Your file uses nonstandard naming. See https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#template for a discussion of what causes this error, and why naming matters.

I think a better gold set would be the PDB Chemical Components Dictionary

Agreed, that seems like a good reference.

aizvorski · 2022-10-20T19:00:58Z

@aizvorski I think a better gold set would be the PDB Chemical Components Dictionary: https://www.wwpdb.org/data/ccd

@richardjgowers Good reference, thanks. Do you mean the "Protonation Variants Companion Dictionary" within that? The big Chemical Component Dictionary contains a whole bunch of ligands which is probably overkill, although it also includes common modified residues like PTR and TPO.

I had a peek and it includes both aa in the normal protonation state and a wide variety of other states, including many which OpenMM wouldn't handle currently (and also for which I think there may not be Amber forcefield parameters). Fully supporting all of those seems like a tall order, and a much bigger feature project than just adding bond orders.

There are bond orders in there, so it actually seems feasible to write a converter from https://files.wwpdb.org/pub/pdb/data/monomers/aa-variants-v1.cif to openmm/app/data/residues.xml

aizvorski · 2022-10-20T19:17:48Z

Your file uses nonstandard naming

Hmm, it seems so, even though it comes straight from Pymol (Pymol bug?). I ran it through pdbfixer which added extra hydrogens, ie both H02 and H2. Manually fixing the names in glygly.pdb works though.

The error message is wrong/confusing: "The set of atoms matches NGLY, but the bonds are different" - that doesn't say anything about nonstandard atom names, or which atoms have the nonstandard names, or how to fix the problem. Why does the code think the bonds are different, as opposed to atom names? Which bonds?

peastman · 2022-10-20T19:29:29Z

Fully supporting all of those seems like a tall order, and a much bigger feature project than just adding bond orders.

At least initially, I think it's sufficient to only support the variants used by addHydrogens().

Why does the code thing the bonds are different, as opposed to atom names? Which bonds?

See https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#template for an explanation of what's going on.

RiesBen added 3 commits September 8, 2022 10:40

updating standard resdiue xml with type and orders

fe76a75

adding type and order to topology loading/parsing

9c2e8c5

Delete residues_orig.xml

b30c16e

RiesBen mentioned this pull request Sep 8, 2022

Protein serialization OpenFreeEnergy/gufe#45

Merged

RiesBen added 2 commits September 9, 2022 11:37

Adding histidine protonation dependend bond order correction

2d2b982

Merge branch 'Adding_bondTypes_Orders_to_stdBonds_CCBonds' of https:/…

106118f

…/github.com/RiesBen/openmm into Adding_bondTypes_Orders_to_stdBonds_CCBonds

corrections for atomTypes

8097249

jchodera mentioned this pull request Sep 25, 2022

Add molecules that provide coverage of post-translational modifications of proteins openmm/spice-dataset#38

Open

jchodera mentioned this pull request Dec 8, 2022

unique_molecules argument should not be optional openforcefield/openff-toolkit#1485

Open

rjdirisio mentioned this pull request Mar 11, 2025

Store explicit bonds for heteroatom residues in PDBxFiles #4844

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding bond types & orders to standard amino acid bonds and cys-cys bonds #3770

Adding bond types & orders to standard amino acid bonds and cys-cys bonds #3770

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adding bond types & orders to standard amino acid bonds and cys-cys bonds #3770

Are you sure you want to change the base?

Adding bond types & orders to standard amino acid bonds and cys-cys bonds #3770

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!