8000 Adding bond types & orders to standard amino acid bonds and cys-cys bonds by RiesBen · Pull Request #3770 · openmm/openmm · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Adding bond types & orders to standard amino acid bonds and cys-cys bonds #3770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

RiesBen
Copy link
@RiesBen RiesBen commented Sep 8, 2022

Hi openmm,
over at openfe/gufe, we wanted to use the PDBFile parser for reading in protein pdbs here! :) But we want to use gufe with openff and therefore need to have the bond-orders. This is currently not possible but easily doable with minor changes to your implementation (as nearly everything is already there). I connected the loose ends and want to suggest these small, non-invasive changes to the topology class in openmm.

Changes:

  • add bond-Type and bond-Order to openmm/wrappers/python/openmm/app/data/residues.xml
  • change topology generation, such that the default 'None' is overwritten by the value in residues.xml for std amino acid residues and nucleic acids / and add it to disulfid bonds - openmm/wrappers/python/openmm/app/topology.py

Please let me know what you think! :)

@peastman
Copy link
Member
peastman commented Sep 8, 2022

Thanks! That looks really interesting.

How did you generate the updated residues.xml? Did you fill in the bond types by hand, or did you generate them programmatically? Either way, where did the values come from?

One concern I have is that setting bond types from PDB files will always be inconsistent. Standard bonds will be marked correctly, but any nonstandard ones from CONECT records (anything other than proteins, nucleic acids, and water, as well as covalent modifications to proteins) won't be. I worry that people will see bond types being present, assume that means all bonds have specified types, and write code like this.

for bond in topology.bonds():
    if bond.type == 'Single':
        # Do something that's supposed to happen for every single bond, but actually won't.

Does anyone have thoughts on this?

@richardjgowers
Copy link
Contributor

@peastman my best answer would be to have them have bond.order = None (or 0 etc) and bond.type = None / 'Unknown', but I agree that this doesn't stop people making bad decisions. In general getting bond orders right on nonstandard parts is something that has to happen after this step.

@j-wags
Copy link
Contributor
j-wags commented Sep 8, 2022

A few points from the OpenFF side:

  • It's probably not necessary to store Single vs Aromatic - The aromaticity of a bond can be determined by looking at the complete chemical graph (this is how it's done in OpenFF, and the user-set bond aromaticity gets readily overwritten).

my best answer would be to have them have bond.order = None (or 0 etc)

  • This seems like a good idea - It may be better long-term to ensure bond.order is an int. Maybe -1 could be a good value to disambiguate "unknown" (bond.order=-1) from "metal chelating/dative(?)" (bond.order=0) bonds.
  • I'd like to advocate for the addition of formal charge in here - I'm looking for a concrete example but something like a resonance-stabilized sulfur radical may make it hard to unambiguously determine the total charge of a molecule from bond orders alone. I could be wrong here but am having trouble finding a fundamental explanation either way.
  • OpenFF uses a tool that ingests a mmCIF file (we use aa_variants.cif from the RCSB) and produces a list of chemical substructures (including variants where "leaving atoms" have left). I'd be happy to work with someone on repurposing this to make a residues.xml file.

@RiesBen
Copy link
Author
RiesBen commented Sep 9, 2022

Hi all,
I added a few lines to deal the different protonation patterns of Histidine, such the standard bond order is adapted:

@peastman I generated the .xml file semi-automatically based on Lehninger Principles of Biochemistry (text-book)

@j-wags @richardjgowers I like the idea of having None bond-order for undefined and maybe "Undefined" for bond-type?

@j-wags if formal charges should be added, I think much more of the code needs to be significantly modified.I would like to suggest to split this into a second PR?

@peastman
Copy link
Member
peastman commented Sep 9, 2022

I added a few lines to deal the different protonation patterns of Histidine, such the standard bond order is adapted:

What about other residues that have multiple protonation states? Do they need special treatment too?

I like the idea of having None bond-order for undefined and maybe "Undefined" for bond-type?

That's how it works right now. The difference is that currently all of them are undefined, because we don't read any file formats that provide bond orders. There are two possible problems we need to consider:

  1. Most bonds have defined orders, but a few are undefined, leading users to make mistakes.
  2. Some bond orders are incorrect.

Both of these are concerns, but of course the second one is a much bigger concern. As you observed, simply adding or removing a hydrogen can change nearby bonds. Same with any covalent modification. How confident are we that any bond orders that get defined will be correct?

For what it's worth, tools like RdKit and OpenBabel try to automatically infer bond orders and formal charges, and in my experience neither one of them is reliable.

@jchodera
Copy link
Member

Just commenting regarding another use case that could potentially be enabled here: If we can get to the point where we have {atom elements and formal charges, and bonds with Kekulized integral bond orders} will be represented in Topology after loading and prepping a PDB structure, this will be fantastic, since we could then have espaloma also parameterize biopolymers in openmmforcefields in the short term (e.g. proteins with post-translational modificaitons or covalent ligands, complex heterogeneous systems, etc).

8000

@peastman
Copy link
Member

For reference, here's the code OpenBabel uses to infer bond orders. It uses a lot of geometric criteria to identify special cases.

RDKit's PDB reader uses a method more like the one here, with a table of which bonds in standard residues are double bonds. Though the code is a bit horrifying to look at. And of course, that only works for standard residues, nothing else.

@j-wags
Copy link
Contributor
j-wags commented Sep 13, 2022

I'm going to meet with @RiesBen and @richardjgowers tomorrow to chat more about OpenFF's and OpenFE's needs regarding PDB loading, and whether it would be best for the code to live in OpenMM or elsewhere. But based on a quick discussion today, it seems like @richardjgowers and I are in agreement that formal charge is needed to unambiguously identify a chemical species. I think my confusion was around whether this PR was intended to support complete chemical identification when loading from PDB.

My major concern in this discussion was that, if we wanted this PR to enable identifying chemical species from PDB files, we'd need to implement formal charge assignment as well, and implementing atomic formal charge in OpenMM's Topology class would be a significant API/behavior change. So my concerns about this PR were along the lines of "if Peter is as resistant to behavior changes as I am, and we know we'll eventually want formal charge assigners as well, then it should all go in one PR".

@RiesBen and @richardjgowers' motivation, if I understand this correctly, is "we've already got all of the bond order assignment stuff lined up in a fork that we're using, we may as well upstream it since it's not an API change". This also makes sense - OpenMM Topologies already have a field for bond order+type information, so there's little harm in filling it in.

So, if we're in agreement that the goal of this PR isn't "fill in complete chemical information for molecules loaded from PDB", then my major concern about this functionality is resolved!

@peastman
Copy link
Member

That seems fine. My main concern at this point is just about correctness. How confident are we that the bond information it fills in will always be accurate, including for different protonation states, different types of chain termination, and proteins with post-translational modifications?

@jchodera
Copy link
Member

The correct long-term solution here---which OpenFF has already started to roll out---is to use the Chemical Components Dictionary as the canonical reference of residues described in the PDB, since the formal charges and bond orders there for all acceptable variants in the PDB are provided and validated.

@peastman
Copy link
Member

It's correct as long as 1) your file only contains things that have been submitted to the PDB, and 2) it refers to everything with the correct names. In practice, a lot of PDB files don't meet those requirements.

@jchodera
Copy link
Member

It's correct as long as 1) your file only contains things that have been submitted to the PDB, and 2) it refers to everything with the correct names. In practice, a lot of PDB files don't meet those requirements.

(1) PDB and mmCIF formats were developed to represent things that came from the PDB. A different format is really necessary for specifying residues or molecules not contained in the PDB---for the exact reason that the PDB does not provide sufficient information to define the chemistry. Their use for representing other things is a misuse and something to provide "best effort" support for, but not something that should preclude supporting their primary intended purpose.

(2) The PDB enforces this naming convention, but you could always use the molecular graph to match residues. The difficulty is that you would need to know how to transform the complete molecules in the CCD into common residues of those molecules (which are only parts of molecules)---a problem that the OpenFF folks have likely examined in much more detail.

@peastman
Copy link
Member
  1. That may have been the original intent, but it's not how they're actually used. PDB files are very widely used as an interchange format. They often contain things that aren't in the CCD, and they almost never follow standard naming for anything that isn't a standard residue.
  2. This assumes you're able to identify all the bonds, which frequently is not the case. It comes back to the issue that most PDB files aren't from RCSB and don't follow the spec.

Even if you assume you're dealing with a file straight from RCSB, there generally isn't enough information to determine what you want. That's because they usually don't have hydrogens, which makes protonation states ambiguous. That in turn means that bond orders and formal charges are ambiguous. It's impossible to assign them when you load the file. You have to wait until hydrogens are added.

@RiesBen
Copy link
Author
RiesBen commented Sep 14, 2022

@peastman
So from a biochemical point of view, if we only limit ourselves to what the residues.xml file provides right now, even though there are protonations (that are not covered by the atom sets in residues.xml), the bond orders will stay correct:
AA_protonations_exceptHis

regarding PTMs, the standard stuff like phosphorylation, sulfatization, glycolization, etc. should not break the current implementation, if we would estimate the bond-orders/add them to the residue.xml, as they should be well behaved.

Example(here serine without phosphate would have a -1 fc, with it would be neutral):

(credit goes to wikipedia)

Alternative problems may be coming from different oxidation states of many-electron metals (Fe, etc.).

@peastman do you have a specific case in mind, that concerns you? :)

@peastman
Copy link
Member

@peastman do you have a specific case in mind, that concerns you? :)

I worry about the cases I know about, but also the ones I don't know about! :)

This feature will definitely need a good set of tests. As a starting point, can we verify that the following cases are all handled correctly?

  • For all the standard protonation variants we know about, load them from a PDB file and check that the bond orders get assigned correctly.
  • For all the same variants, load a PDB file without hydrogens, call Modeller.addHydrogens() to add them, and verify the bond orders end up correct.

We should check all the variants supported by addHydrogens(), and also a few ways of terminating protein and nucleic acid chains. (Nucleic acids especially can be terminated in a few different ways that are all common in the PDB.)

@RiesBen
Copy link
Author
RiesBen commented Sep 23, 2022

Regarding the PTMs this is a cool collection of 'frequent' PTMs. (I guess in a lot of modelling approaches they are still rare)
Representative-posttranslational-modifications-on-proteins-The-protein-complex-shown-in_W640
(credit goes to: https://pubs.rsc.org/en/content/articlelanding/2011/MB/C0MB00216J#!divRelatedContent&articles)

I think besides the ubiquitin, they could be easily realized with templates like the residue.xml already contains (either recognize them separate from amino acids (AA) or as modified AAs - only a couple of AAs are affected), consequently the current implementation would not need to change.

hm I'm off now for some time (3 Weeks), but we can discuss the next steps, when I'm back. :)

@aizvorski
Copy link
Contributor

@RiesBen @peastman I just wanted to say this is an incredibly useful PR. Can we merge this in before the next release?

@peastman
Copy link
Member

There's a lot of design and testing still needed first. See the discussion above, for example #3770 (comment).

@aizvorski
Copy link
Contributor
aizvorski commented Oct 20, 2022

@peastman I made a pdb with every aminoacid variant, sequence is GLY-ASH-ASP-CYS-CYX-GLH-GLU-HID-HIE-HIP-HIN-LYN-LYS. Also, to check different caps, GLY-GLY and ACE-GLY-GLY-NME. These are made in pymol.

Unfortunately these don't load into openmm 7.7 out of the box, even without the changes in this PR.

Code:

import openmm as mm
from openmm import app
protein_filename = 'variants.pdb'
pdb = app.PDBFile(protein_filename)
forcefield = app.ForceField("amber14-all.xml", "amber14/tip3pfb.xml")
system = forcefield.createSystem(
    pdb.topology,
    nonbondedMethod=app.NoCutoff,
)

Result:
variants.pdb - "ValueError: No template found for residue 1 (GLY). The set of atoms matches NGLY, but the bonds are different"
glygly.pdb - "ValueError: No template found for residue 1 (GLY). The set of atoms matches NGLY, but the bonds are different"
aceglyglynme.pdb - "ValueError: No template found for residue 2 (GLY). The set of atoms matches GLY, but the bonds are different"

The pdbs seem fine to me by looking through them. Could you help figure out what it is that openmm doesn't like about them?

(Possible clue: pdb4amber -i glygly.pdb -o glygly_modified.pdb --add-missing-atoms --nohyd produces a file which does load, the only changes seem to be the names and locations of the hydrogens. Doing the same to variants.pdb also loads, but changes a bunch of the protonation states)

variants.pdb.txt
glygly.pdb.txt
aceglyglynme.pdb.txt

@aizvorski
Copy link
Contributor

@peastman Once we do get variants.pdb loading, how can we check the bond orders and types? We could inspect those manually and set up a regression test. Alternatively, we could compare with the bond orders when the same pdb is read in by oechem.

@richardjgowers
Copy link
Contributor

@aizvorski I think a better gold set would be the PDB Chemical Components Dictionary: https://www.wwpdb.org/data/ccd

@peastman
Copy link
Member

Your file uses nonstandard naming. See https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#template for a discussion of what causes this error, and why naming matters.

I think a better gold set would be the PDB Chemical Components Dictionary

Agreed, that seems like a good reference.

@aizvorski
Copy link
Contributor
aizvorski commented Oct 20, 2022

@aizvorski I think a better gold set would be the PDB Chemical Components Dictionary: https://www.wwpdb.org/data/ccd

@richardjgowers Good reference, thanks. Do you mean the "Protonation Variants Companion Dictionary" within that? The big Chemical Component Dictionary contains a whole bunch of ligands which is probably overkill, although it also includes common modified residues like PTR and TPO.

I had a peek and it includes both aa in the normal protonation state and a wide variety of other states, including many which OpenMM wouldn't handle currently (and also for which I think there may not be Amber forcefield parameters). Fully supporting all of those seems like a tall order, and a much bigger feature project than just adding bond orders.

There are bond orders in there, so it actually seems feasible to write a converter from https://files.wwpdb.org/pub/pdb/data/monomers/aa-variants-v1.cif to openmm/app/data/residues.xml

@aizvorski
Copy link
Contributor
aizvorski commented Oct 20, 2022

Your file uses nonstandard naming

Hmm, it seems so, even though it comes straight from Pymol (Pymol bug?). I ran it through pdbfixer which added extra hydrogens, ie both H02 and H2. Manually fixing the names in glygly.pdb works though.

The error message is wrong/confusing: "The set of atoms matches NGLY, but the bonds are different" - that doesn't say anything about nonstandard atom names, or which atoms have the nonstandard names, or how to fix the problem. Why does the code think the bonds are different, as opposed to atom names? Which bonds?

@peastman
Copy link
Member

Fully supporting all of those seems like a tall order, and a much bigger feature project than just adding bond orders.

At least initially, I think it's sufficient to only support the variants used by addHydrogens().

Why does the code thing the bonds are different, as opposed to atom names? Which bonds?

See https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#template for an explanation of what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0