Data input and T006 Maximum common substructure #429

dgcovell · 2025-02-10T17:43:39Z

To begin, I would like to thank the Volkamer lab for this series of tutorials. Working through each has been rewarding and instructive.

I have focused on T006 Maximum common substructure and am in need of further details. The first has to do with input data
( ../T005_compound_clustering/data/molecule_set_largest_cluster.sdf) which has been processed through the code appearing
under T000. Replacing this sdf file with my own is not straightforward.
sdf = str(HERE / "../T005_compound_clustering/data/molecule_set_largest_cluster.sdf")
supplier = Chem.ForwardSDMolSupplier(sdf)
mols = list(supplier)
print(f"Set with {len(mols)} molecules loaded.")
print(dir(mols))

Set with 145 molecules loaded.
['add', 'class', 'class_getitem', 'contains', 'delattr', 'delitem', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'getitem', 'gt', 'hash', 'iadd', 'imul', 'init', 'init_subclass', 'iter', 'le', 'len', 'lt', 'mul', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'reversed', 'rmul', 'setattr', 'setitem', 'sizeof', 'str', 'subclasshook', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

Replacing the input file with my own yields the same results for print(dir(mols)). However the designation of mols <- list(supplier) where
supplier <- Chem.ForwardSDMolSupplier(sdf_new) cannot be completed. I trace Chem.ForwardSDMolSupplier back to T000, where the
file has been derived/processed from the Chembl library. Can you please indicate the steps for defining mols from a sdf file created with openbabel?
My second question has to correcting the error described below.

Add molecule column to data frame

PandasTools.AddMoleculeColumnToFrame(mol_df, "smiles")
mol_df.head(3)
Failed to patch pandas - unable to change molecule rendering

Blog suggestions for correcting this do not appear to work. Might you have a solution?

My third question is whether the steps for searching a large database using the SMARTS derived in T006 are available? I find
https://github.com/rdkit/rdkit-tutorials/blob/master/notebooks/002_SMARTS_SubstructureMatching.ipynb
However a tutorial that could be used to search a large database for mcs’s as derived in T006 would be helpful?
I believe this functionality is imbedded in T006, but not sure. As with the earlier question, searching from a generic sdf file would be needed.
Fourth, I have a few related questions specific to t006:
a. Is there a way to enlarge the images. Highlighting is difficult to see. Or maybe there is a way to enlarge only the highlighting or enhance the color?
b. Can you provide the code for converting between smiles and smarts. I realize the world has gone to smarts. Having a smarts to smiles might help move me forward.

Thanks in advance for your help and providing this tutorial series. Hopefully I can process these and any future ones. I see that an example of Machine Learning is listed. But this is not in the teachopencadd suite. If I upload its pynb should it work as well, or is there a need to include data files?

Regards,
David Covell, Ph.D.
NIH-NCI

sakhawathsumit added enhancement New feature or request question Further information is requested labels Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data input and T006 Maximum common substructure #429

Data input and T006 Maximum common substructure #429

Data input and T006 Maximum common substructure #429

Data input and T006 Maximum common substructure #429

Comments

Add molecule column to data frame