[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The same bit from GetBitInfoMap() seems to point to different molecular fragments #7003

Open
GLPG-GT opened this issue Dec 20, 2023 · 4 comments

Comments

@GLPG-GT
Copy link
GLPG-GT commented Dec 20, 2023

Loading modules:

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
import rdkit
from rdkit.Chem import Draw

Initialisation of the fingerprint generator:

mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius = 2)
ao = rdFingerprintGenerator.AdditionalOutput()
ao.AllocateBitInfoMap()

First molecule:

input_smiles = 'c1ccc2nnn2cc1'
features_dict_bits = dict()
features_dict_onbits = dict()
m = Chem.MolFromSmiles(input_smiles)
sfp = mfpgen.GetSparseFingerprint(m, additionalOutput = ao)
bitinfomap = ao.GetBitInfoMap()
bits = list(bitinfomap.keys())

for b, ob in zip(bits, onbits) :    
    if b not in features_dict_bits :
        tu = bitinfomap[b]
        tu0 = tu[0]
        a = tu0[0]
        r = tu0[1]
        env = Chem.FindAtomEnvironmentOfRadiusN(m, r, a)
        atoms = set()
        if r > 0 :
            for bidx in env :
                atoms.add(m.GetBondWithIdx(bidx).GetBeginAtomIdx())
                atoms.add(m.GetBondWithIdx(bidx).GetEndAtomIdx())
            usm = Chem.MolFragmentToSmiles(m, atomsToUse = list(atoms), bondsToUse = env, rootedAtAtom = a)
        else :
            atoms.add(a)
            usm = Chem.MolFragmentToSmiles(m, atomsToUse = list(atoms), bondsToUse = [], rootedAtAtom = a)
        features_dict_bits[b] = usm
        features_dict_onbits[ob] = usm        
features_dict_bits

{98513984: 'c(c)c',
 222023415: 'n(n)n',
 571978829: 'c(c)n',
 951226070: 'c(c)c',
 1375037798: 'c1(cc)nnn-1c',
 2041434490: 'n',
 2092489639: 'n',
 2207938650: 'c(cc)n(-c)n',
 2284773903: 'c(c)(n)-n',
 2290389376: 'c(cc)cn',
 2309124962: 'n1nc-n1c',
 2433800525: 'n1(cc)nnc-1c',
 2685705257: 'n1nn-c1c',
 2763854213: 'c(cc)cc',
 3053370223: 'c(cc)c(n)-n',
 3217380708: 'c',
 3218693969: 'c',
 3458774156: 'n(c)n',
 3999906991: 'c(cc)cc',
 4210054575: 'n(-c)(c)n'}
Draw.MolsToGridImage([Chem.MolFromSmarts(sm) for sm in list(features_dict_bits.values())],
                      molsPerRow = 5,
                      subImgSize = (200, 100),
                     legends = [str(bn) for bn in list(features_dict_bits.keys())]
                    )

download

Second molecule:

input_smiles = 'Cn1cc(nn1)N2Cc3ncncc3C2=O'
features_dict_bits = dict()
features_dict_onbits = dict()
m = Chem.MolFromSmiles(input_smiles)
sfp = mfpgen.GetSparseFingerprint(m, additionalOutput = ao)
bitinfomap = ao.GetBitInfoMap()
bits = list(bitinfomap.keys())

for b, ob in zip(bits, onbits) :    
    if b not in features_dict_bits :
        tu = bitinfomap[b]
        tu0 = tu[0]
        a = tu0[0]
        r = tu0[1]
        env = Chem.FindAtomEnvironmentOfRadiusN(m, r, a)
        atoms = set()
        if r > 0 :
            for bidx in env :
                atoms.add(m.GetBondWithIdx(bidx).GetBeginAtomIdx())
                atoms.add(m.GetBondWithIdx(bidx).GetEndAtomIdx())
            usm = Chem.MolFragmentToSmiles(m, atomsToUse = list(atoms), bondsToUse = env, rootedAtAtom = a)
        else :
            atoms.add(a)
            usm = Chem.MolFragmentToSmiles(m, atomsToUse = list(atoms), bondsToUse = [], rootedAtAtom = a)
        features_dict_bits[b] = usm
        features_dict_onbits[ob] = usm        
features_dict_bits

{10565946: 'O=C',
 202930282: 'n(C)(cc)nn',
 222023415: 'n(n)n',
 422274696: 'c(c(n)N)n(C)n',
 572760528: 'c(c)n',
 662511025: 'N(c)(C)C',
 725322217: 'c(n)n',
 864942730: 'O',
 919827735: 'c(nc)c(C)c',
 1042419715: 'c(cn)(nn)N(C)C',
 1076610048: 'C(c(c)n)N(c)C',
 1100037548: 'n(c)c',
 1178619885: 'c(nc)nc',
 1255626466: 'n(c)(C)n',
 1299099804: 'N(Cc)(C(c)=O)c(c)n',
 1718532258: 'c(CN)(nc)c(C)c',
 1727438911: 'n(cn)c(c)C',
 1777981749: 'c(cn)(C(N)=O)c(C)n',
 1871221615: 'C(c)(N)=O',
 2041434490: 'n',
 2092489639: 'n',
 2246728737: 'C',
 2284773903: 'c(c)(n)N',
 2513428378: 'C(c)N',
 2685705257: 'n(nn)c(c)N',
 2968968094: 'C',
 3034212252: 'c(c)(C)n',
 3118255683: 'n(c)c',
 3165426352: 'n(nc)n(c)C',
 3217380708: 'c',
 3218693969: 'c',
 3229069614: 'n(cc)cn',
 3458774156: 'n(c)n',
 3657471097: 'Cn',
 3777168895: 'c(c)n',
 3832608534: 'C(=O)(c(c)c)N(c)C',
 3982076256: 'c(C)(c)c'}
Draw.MolsToGridImage([Chem.MolFromSmarts(sm) for sm in list(features_dict_bits.values())],
                      molsPerRow = 5,
                      subImgSize = (200, 100),
                     legends = [str(bn) for bn in list(features_dict_bits.keys())]
                    )

download

You may spot that feature bit 2685705257 is present in both dictionaries, but: 1) it maps to a different substructure, assuming that our code to generate the fragment is correct; 2) in the first molecule it is a 4-membered aromatic ring, in the second molecule it probably comes from some ring-opening of the triazole.

Do you have any explanation?
Is this a bit collision of the Morgan algorithm?
Do you know of or expect similar cases?

Thanks

@GLPG-GT GLPG-GT added the bug label Dec 20, 2023
@greglandrum
Copy link
Member

Hi @GLPG-GT : this is a bit collision at the main hash level. It's not particularly common, particularly when compared to collisions due to bit folding, but it is definitely something that can happen.

Here's another example we identified some years ago: #814

The type of hash function that we use to convert atom environments into bit IDs are never perfect and will always produce this type of collision. The hope is that they will be rare.

@greglandrum
Copy link
Member

I'm going to mark this as "won't fix" because I think this type of collision is more or less inevitable.

@greglandrum
Copy link
Member

As an aside: your fragment drawings would be more representative of what the fragment actually is if you use MolFromSmiles() with sanitization turned off instead of MolFromSmarts() (which interprets unspecified bonds as 'single or aromatic')

Copy link
Contributor

This issue was marked as stale because it has been open for 90 days with no activity.

@github-actions github-actions bot added the stale label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants