-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMILES parsing in CDK and OpenEye but not RDKit #8119
Comments
Hi @schymane, thanks for pointing these out. I'm going to mark this as "not a bug" because I think that the RDKit is behaving as expected. There's some more information about this below. Hopefully this will be useful to you as you reply to the reviewer The first thing: all of these SMILES are syntactically correct and the RDKit can parse them without problems:
What I have done above is turned off the RDKit's sanitization: a step which is carried out by default to try and make sure that a molecule is chemically reasonable. All of these molecules fail in sanitization because they have (at least) one atom with a valence that the RDKit does not like. There's information on how this calculation is done (and sanitization in general) here: https://rdkit.org/docs/RDKit_Book.html#valence-calculation-and-allowed-valences The reason the RDKit does this by default is to try and prevent incorrect or badly drawn structures from accidentally being used in calculations. One consequence of the way the calculation is done is that many hypervalent species, like One of your example molecules is actually a nice demonstration of why the RDKit behaves the way it does: In cases where you don't care about chemical correctness or want to support hypervalent species, it's always possible to disable individual steps of the sanitization process in order to allow the questionable structures through. There's an example in the cookbook showing how to disable the valence checks: https://www.rdkit.org/docs/Cookbook.html#explicit-valence-error-partial-sanitization |
Thanks @greglandrum - I agree that this is behaving as expected (which is why we didn't raise an issue earlier) and is not a bug, that tag came "for free" when I posted the issue (sorry). OK so we have a few different cases here. Firstly, thanks for the partial sanitization tips, @neoflex has already updated the PubChemLite-web code to fix the missing depictions for CIDs failing sanitization (2e8658bf). In the current version this affects 11 CIDs:
The low CIDs (first 6 listed above, 24258, 24594, 24606, 24637, 61654, 85645) have heaps of annotation in PubChem and don't appear to be deposition artifacts. The "partial sanitization" approach should help deal with these better (thanks!). CID 167661 only has a small amount of annotation (from ECHA, 1 notification, 37 reports) but is in CAS Common Chemistry, CompTox etc so seems OK. CID 3470663 has no annotation in it's own right but maps to two related species (CIDs 10992990 and 13620934) with a small amount of annotation from ECHA again. Each has only 1 notification (38 reports). @PaulThiessen we may need to check if there's any issues in the ECHA data, there's a common pattern here and this has been expanding a lot recently (adding noise?). I've added a title to 3470663 for Leon to grab to fix the CID 44629788 again has no annotation in its own right, and only one related CID with annotation 44629787 from ECHA again - 1 notification, 1 report (embedded URLs direct to the annotation). There's definitely a pattern here. We will look into refining the "Safety and Hazards" category in the new year (eci/pclbuild/-/issues/2). For CID 53471611 indeed this looks like an artifact that has propagated, as you point out. It's in PubChemLite due to Use and Manufacturing annotation from the NORMAN-SLE List S25 and it's also in list S89. I have implemented a fix for these lists (50f0023b should fix S25, e39585ec should fix S89, Jeff has also processed these) to remove this structures & annotation from the NORMAN-SLE, which will remove it from PubChemLite in due course once that annotation is gone. Finally CID 57345757 again has no annotation in its own right and displays with the name "Test 1" in PubChemLite(!!), but has one related CID 57345756 which also has the informative synonym TEST1. This has one piece of annotation from the EU Clinical Trials register. There seems to be only one substance: 136367740. The CAS number in that deposition points to something else (see e.g. 50-41-9 or 50-41-9). Leon will block the "test1" synonym and the annotation should be associated with Amantadine instead (https://www.clinicaltrialsregister.eu/ctr-search/trial/2021-004868-95/ES), which should remove this from PubChemLite in due course. |
Hi all,
We've been asked by a reviewer to post an issue reporting this behaviour, in the hope it can be resolved in future releases.
We have a set of 12 CIDs from PubChem that are processed with OpenEye (PubChem, source of the SMILES) and CDK (MetFrag) but fail in c3sdb, which uses RDKit (requirements). Below the table is a screenshot of how they appear in CDK Depict.
O=Cl(=O)(=O)F
FBr(F)F
FBr(F)(F)(F)F
FCl(F)F
FCl(F)(F)(F)F
FI(F)(F)(F)(F)(F)F
O=ClCl(=O)=O
C1=CC=C2C(=C1)O[Si-2]34(O2)(OC5=CC=CC=C5O3)OC6=CC=CC=C6O4
C1(=O)C(=O)O[Ge-2]23(O1)(OC(=O)C(=O)O2)OC(=O)C(=O)O3
CC1=CC(=[N+](C(=C1)C)[Br-][N+]2=C(C=C(C=C2C)C)C)C
C1(=C(C(=C(C(=C1F)F)[Si](Cl)(Cl)Cl(C(C(C(C(C(C(C(C(F)(F)F)(F)F)(F)F)(F)F)(F)F)(F)F)(F)F)(F)F)F)F)F)F
CNCCCCOC1CCCCC1CCC2CCCCC2ClC3CCCCC3
We do not know the identity of the reviewer but if you need further information I am also tagging @PaulThiessen (PubChem) and @dylanhross (c3sdb). The additional context is in this preprint: DOI:10.26434/chemrxiv-2024-2xcsq. Thanks!
The text was updated successfully, but these errors were encountered: