[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMILES parsing in CDK and OpenEye but not RDKit #8119

Open
schymane opened this issue Dec 17, 2024 · 2 comments
Open

SMILES parsing in CDK and OpenEye but not RDKit #8119

schymane opened this issue Dec 17, 2024 · 2 comments

Comments

@schymane
Copy link

Hi all,

We've been asked by a reviewer to post an issue reporting this behaviour, in the hope it can be resolved in future releases.
We have a set of 12 CIDs from PubChem that are processed with OpenEye (PubChem, source of the SMILES) and CDK (MetFrag) but fail in c3sdb, which uses RDKit (requirements). Below the table is a screenshot of how they appear in CDK Depict.

CID Formula SMILES
24258 ClFO3 O=Cl(=O)(=O)F
24594 BrF3 FBr(F)F
24606 BrF5 FBr(F)(F)(F)F
24637 ClF3 FCl(F)F
61654 ClF5 FCl(F)(F)(F)F
85645 F7I FI(F)(F)(F)(F)(F)F
167661 Cl2O3 O=ClCl(=O)=O
3470663 C18H12O6Si C1=CC=C2C(=C1)O[Si-2]34(O2)(OC5=CC=CC=C5O3)OC6=CC=CC=C6O4
16687967 C6GeO12 C1(=O)C(=O)O[Ge-2]23(O1)(OC(=O)C(=O)O2)OC(=O)C(=O)O3
44629788 C16H22BrN2 CC1=CC(=[N+](C(=C1)C)[Br-][N+]2=C(C=C(C=C2C)C)C)C
53471611 C14Cl3F23Si C1(=C(C(=C(C(=C1F)F)[Si](Cl)(Cl)Cl(C(C(C(C(C(C(C(C(F)(F)F)(F)F)(F)F)(F)F)(F)F)(F)F)(F)F)(F)F)F)F)F)F
57345757 C25H47ClNO CNCCCCOC1CCCCC1CCC2CCCCC2ClC3CCCCC3

image

We do not know the identity of the reviewer but if you need further information I am also tagging @PaulThiessen (PubChem) and @dylanhross (c3sdb). The additional context is in this preprint: DOI:10.26434/chemrxiv-2024-2xcsq. Thanks!

@schymane schymane added the bug label Dec 17, 2024
@greglandrum
Copy link
Member

Hi @schymane, thanks for pointing these out. I'm going to mark this as "not a bug" because I think that the RDKit is behaving as expected. There's some more information about this below. Hopefully this will be useful to you as you reply to the reviewer

The first thing: all of these SMILES are syntactically correct and the RDKit can parse them without problems:

>>> smis = ['O=Cl(=O)(=O)F',
... 'FBr(F)F',
... 'FBr(F)(F)(F)F',
... 'FCl(F)F',
... 'FCl(F)(F)(F)F',
... 'FI(F)(F)(F)(F)(F)F',
... 'O=ClCl(=O)=O',
... 'C1=CC=C2C(=C1)O[Si-2]34(O2)(OC5=CC=CC=C5O3)OC6=CC=CC=C6O4',
... 'C1(=O)C(=O)O[Ge-2]23(O1)(OC(=O)C(=O)O2)OC(=O)C(=O)O3',
... 'CC1=CC(=[N+](C(=C1)C)[Br-][N+]2=C(C=C(C=C2C)C)C)C',
... 'C1(=C(C(=C(C(=C1F)F)[Si](Cl)(Cl)Cl(C(C(C(C(C(C(C(C(F)(F)F)(F)F)(F)F)(F)F)(
... F)F)(F)F)(F)F)(F)F)F)F)F)F',
... 'CNCCCCOC1CCCCC1CCC2CCCCC2ClC3CCCCC3']
>>> ms = [Chem.MolFromSmiles(x, sanitize=False) for x in smis]

What I have done above is turned off the RDKit's sanitization: a step which is carried out by default to try and make sure that a molecule is chemically reasonable.

All of these molecules fail in sanitization because they have (at least) one atom with a valence that the RDKit does not like. There's information on how this calculation is done (and sanitization in general) here: https://rdkit.org/docs/RDKit_Book.html#valence-calculation-and-allowed-valences

The reason the RDKit does this by default is to try and prevent incorrect or badly drawn structures from accidentally being used in calculations. One consequence of the way the calculation is done is that many hypervalent species, like FBr(F)F, are flagged as having valence problems. This isn't ideal, but these species have not seemed to be common enough to merit having a special case that accepts the perfectly reasonable BrF3 but rejects likely errors like CBrC. Users who regularly work with hypervalence species will probably want to use partial sanitization (see below).

One of your example molecules is actually a nice demonstration of why the RDKit behaves the way it does:
Pubchem compound 53471611 is connected to substance 162909720. The depositors of that structure also provided a CAS number: 753025-21-7. If you search pubchem for that CAS number, you find several other substances (and linked compounds), all of which are chemically reasonable. Here's one example. I think it's very likely that substance 162909720 includes a drawing error. Unquestionablly accepting two-coordinate Cl allows mistakes like this to continue to propagate through our ecosystem (look at the number of external links provided for compound 53471611... it's horrible)

In cases where you don't care about chemical correctness or want to support hypervalent species, it's always possible to disable individual steps of the sanitization process in order to allow the questionable structures through. There's an example in the cookbook showing how to disable the valence checks: https://www.rdkit.org/docs/Cookbook.html#explicit-valence-error-partial-sanitization

@greglandrum greglandrum removed the bug label Dec 18, 2024
@schymane
Copy link
Author
schymane commented Dec 18, 2024

Thanks @greglandrum - I agree that this is behaving as expected (which is why we didn't raise an issue earlier) and is not a bug, that tag came "for free" when I posted the issue (sorry).

OK so we have a few different cases here.

Firstly, thanks for the partial sanitization tips, @neoflex has already updated the PubChemLite-web code to fix the missing depictions for CIDs failing sanitization (2e8658bf). In the current version this affects 11 CIDs:

24258
24594
24606
24637
61654
85645
167661
3470663
44629788
53471611
57345757

The low CIDs (first 6 listed above, 24258, 24594, 24606, 24637, 61654, 85645) have heaps of annotation in PubChem and don't appear to be deposition artifacts. The "partial sanitization" approach should help deal with these better (thanks!).

CID 167661 only has a small amount of annotation (from ECHA, 1 notification, 37 reports) but is in CAS Common Chemistry, CompTox etc so seems OK.

CID 3470663 has no annotation in it's own right but maps to two related species (CIDs 10992990 and 13620934) with a small amount of annotation from ECHA again. Each has only 1 notification (38 reports). @PaulThiessen we may need to check if there's any issues in the ECHA data, there's a common pattern here and this has been expanding a lot recently (adding noise?). I've added a title to 3470663 for Leon to grab to fix the CID 3470663 title currently on display in PubChem and the missing name in PubChemLite-web (see e415e1e4).

CID 44629788 again has no annotation in its own right, and only one related CID with annotation 44629787 from ECHA again - 1 notification, 1 report (embedded URLs direct to the annotation). There's definitely a pattern here. We will look into refining the "Safety and Hazards" category in the new year (eci/pclbuild/-/issues/2).

For CID 53471611 indeed this looks like an artifact that has propagated, as you point out. It's in PubChemLite due to Use and Manufacturing annotation from the NORMAN-SLE List S25 and it's also in list S89. I have implemented a fix for these lists (50f0023b should fix S25, e39585ec should fix S89, Jeff has also processed these) to remove this structures & annotation from the NORMAN-SLE, which will remove it from PubChemLite in due course once that annotation is gone.

Finally CID 57345757 again has no annotation in its own right and displays with the name "Test 1" in PubChemLite(!!), but has one related CID 57345756 which also has the informative synonym TEST1. This has one piece of annotation from the EU Clinical Trials register. There seems to be only one substance: 136367740. The CAS number in that deposition points to something else (see e.g. 50-41-9 or 50-41-9). Leon will block the "test1" synonym and the annotation should be associated with Amantadine instead (https://www.clinicaltrialsregister.eu/ctr-search/trial/2021-004868-95/ES), which should remove this from PubChemLite in due course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants