8000 3D SDF generation from csv file · Issue #2453 · rdkit/rdkit · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3D SDF generation from csv file #2453

Closed
sbhakat opened this issue May 17, 2019 · 15 comments
Closed

3D SDF generation from csv file #2453

sbhakat opened this issue May 17, 2019 · 15 comments
Labels

Comments

@sbhakat
Copy link
sbhakat commented May 17, 2019

I am new to RDKit and trying to generate a SDF file from a .csv file containing >1000 SMILES. The .csv file looks something like following (each line is one SMILE):

O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2ccccc2C)C)CC1(C)C
Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(=O)C)(CC(C)C)C1=O)CCc1ccccc1)[C@H](O)[C@@H]1[NH2+]C[C@H](OCCC)C1
S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(OC)ccc1)Cc1ccccc1)C
S1(=O)(=O)C[C@@H](Cc2cc(O[C@H](COCC)C(F)(F)F)c(N)c(F)c2)[C@H](O)[C@@H]([NH2+]Cc2cc(ccc2)C(C)(C)C)C1
S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(ccc1)C(F)(F)F)Cc1ccccc1)C

I generated the 2D SDF file using the following script

import pandas as pd
from rdkit.Chem import PandasTools
pp = pd.read_csv('only_smile.csv', names=['Smiles'])                 
PandasTools.AddMoleculeColumnToFrame(pp,'Smiles')           
PandasTools.WriteSDF(pp, 'pp_out.sdf')

I want to add hydrogens and want to generated 3D SDF file for the same. Any help?

@lilleswing
Copy link
Contributor
pp['Smiles'] = [Chem.AddHs(x) for x in pp['Smiles'].values.tolist()]
PandasTools.WriteSDF(pp, 'pp_out.sdf')
``

@sbhakat
Copy link
Author
sbhakat commented May 18, 2019

@lilleswing I got an error like following

>>> from rdkit import Chem
>>> pp['Smiles'] = [Chem.AddHs(x) for x in pp['Smiles'].values.tolist()]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
Boost.Python.ArgumentError: Python argument types in
    rdkit.Chem.rdmolops.AddHs(str)
did not match C++ signature:
    AddHs(RDKit::ROMol mol, bool explicitOnly=False, bool addCoords=False, boost::python::api::object  bool addResidueInfo=False)

any help?

@CKannas
Copy link
Contributor
CKannas commented May 18, 2019 via email

@sbhakat
Copy link
Author
sbhakat commented May 20, 2019

@CKannas got an error like

>>> pp['ROMol'] = [Chem.AddHs(x) for x in pp['ROMol'].values.tolist()]
Traceback (most recent call last):
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'ROMol'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'ROMol'

Any help?

@CKannas
Copy link
Contributor
CKannas commented May 21, 2019 via email

@CKannas
Copy link
Contributor
CKannas commented May 21, 2019 via email

@sbhakat
Copy link
Author
sbhakat commented May 21, 2019

Here is the full script which I am trying to execute

import pandas as pd
from rdkit.Chem import PandasTools
from rdkit import Chem
from rdkit import RDConfig
pp = pd.read_csv('only_smile.csv', names=['Smiles']) 
pp['ROMol'] = [Chem.AddHs(x) for x in pp['ROMol'].values.tolist()]                
PandasTools.AddMoleculeColumnToFrame(pp,'Smiles')           
PandasTools.WriteSDF(pp, 'pp_out.sdf')

The only_smile.csv contains SMILES in each line something as mentioned before.

@CKannas
Copy link
Contributor
CKannas commented May 21, 2019 via email

@sbhakat
Copy link
Author
sbhakat commented May 22, 2019

Here is my try


>>> import pandas as pd
>>> import rdkit
>>> from rdkit import Chem
>>> from rdkit.Chem import AllChem
>>> from rdkit.Chem import PandasTools
>>> 
>>> print(rdkit.rdBase.rdkitVersion)
2018.09.1
>>> smiles_file = "./only_smile.csv"
>>> world_drugs = pd.read_csv(smiles_file)
>>> print(len(world_drugs))
1512
>>> 
>>> 
>>> world_drugs.head(2)
                                                   O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2ccccc2C)C)CC1(C)C
0  Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(=O)C)(CC(C)C)C1=O)CCc1ccccc1)[C@H](O)[C@@H]1[NH2+]C[C@H](OCCC)C1
1                          S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(OC)ccc1)Cc1ccccc1)C
>>> 
>>> PandasTools.AddMoleculeColumnToFrame(world_drugs, "smiles")

Got an error like

Traceback (most recent call last):
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Smiles'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/rdkit/Chem/PandasTools.py", line 299, in AddMoleculeColumnToFrame
    frame[molCol] = frame[smilesCol].map(Chem.MolFromSmiles)
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/home/sbhakat/miniconda2/envs/rdkit/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.in
8000
dex.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Smiles'

My .csv file has no header or anything it is just like following

O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2ccccc2C)C)CC1(C)C
Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(=O)C)(CC(C)C)C1=O)CCc1ccccc1)[C@H](O)[C@@H]1[NH2+]C[C@H](OCCC)C1
S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(OC)ccc1)Cc1ccccc1)C
S1(=O)(=O)C[C@@H](Cc2cc(O[C@H](COCC)C(F)(F)F)c(N)c(F)c2)[C@H](O)[C@@H]([NH2+]Cc2cc(ccc2)C(C)(C)C)C1
S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(ccc1)C(F)(F)F)Cc1ccccc1)C
S1(=O)C[C@@H](Cc2cc(OC(C(F)(F)F)C(F)(F)F)c(N)c(F)c2)[C@H](O)[C@@H]([NH2+]Cc2cc(ccc2)C(C)(C)C)C1
S(=O)(=O)(CCCCC)C[C@@H](NC(=O)c1cccnc1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(ccc1)CC)Cc1cc(F)cc(F)c1
Fc1c2c(ccc1)[C@@]([NH+]=C2N)(C=1C=C(C)C(=O)N(C=1)CC)c1cc(ccc1)-c1cc(cnc1)C#CC
O1c2c(cc(cc2)CC)[C@@H]([NH2+]C[C@@H](O)[C@H]2NC(=O)C=3C=CC(=O)N(CCCCc4cc(C2)ccc4)C=3)CC12CCC2
O=C1N(CCCC1)C(C)(C)[C@@H]1C[C@@H](CCC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(ccc1)C(C)C)Cc1ccccc1

@pgg1610
Copy link
pgg1610 commented Jul 31, 2020

One way to solve this is to have pp dataframe explicitly mention SMILES as the key:

df = pd.read_csv(PATH_CSV, names=['SMILES']) #Add more column names here as per your csv 
PandasTools.AddMoleculeColumnToFrame(df, "SMILES") #This add the ROMol column with the molecular representation
df["Mol_H"] = df["ROMol"].apply(Chem.AddHs)
Chem.PandasTools.WriteSDF(df,'{}/SDF_File.sdf'.format(source_csv_dir), idName=None, properties=list(df.columns), allNumeric=False)

@mainguyenanhvu
Copy link

@pgg1610 thank you for your solution. I would like to ask you a question that is how to speed up (or parallel) process? This is because I have a lot of smiles.

@mainguyenanhvu
Copy link

I don't know why this error happed after I re-run:
I face the error:

PandasTools.WriteSDF(pp, args.output_file, molColName='ID', properties=list(pp.columns))
  File "/scratch/micromamba/envs/biotools_py39/lib/python3.9/site-packages/rdkit/Chem/PandasTools.py", line 440, in WriteSDF
    mol = Chem.Mol(row[1][molColName])
RuntimeError: Bad pickle format: unexpected End-of-File while reading

I updated pandas == 2.0.0 as here but it still errored.

Please help me to solve it.

My code here:

import pandas as pd
from pprint import pprint
from rdkit.Chem import PandasTools
from rdkit import Chem
from rdkit.Chem import AllChem

pp = pd.read_csv(args.input_file)
PandasTools.AddMoleculeColumnToFrame(pp,'smiles')
pp["Mol_H"] = pp["ROMol"].apply(Chem.AddHs)
pp["Mol_H"].map(AllChem.EmbedMolecule)
pprint(pp)
PandasTools.WriteSDF(pp, args.output_file, molColName='ID', properties=list(pp.columns))

Another error:

PandasTools.WriteSDF(pp, args.output_file, molColName='ID', properties=list(pp.columns))
  File "/scratch/micromamba/envs/biotools_py39/lib/python3.9/site-packages/rdkit/Chem/PandasTools.py", line 440, in WriteSDF
    mol = Chem.Mol(row[1][molColName])
RuntimeError: Bad pickle format: bad endian ID or invalid file format

@mainguyenanhvu
Copy link

Change to PandasTools.WriteSDF(df, args.output_file, idName='AQID', properties=list(df.columns)) to fix this error.

Copy link
Contributor
github-actions bot commented Dec 5, 2024

This issue was marked as stale because it has been open for 90 days with no activity.

@github-actions github-actions bot added the stale label Dec 5, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants
0