Fixing does not appear to work for inferring valence & formal charge states from molecules from some PDB files #231

Croydon-Brixton · 2024-09-03T02:23:36Z

Thank you for this nice library!

I'm have a question re fixing 'broken' Mols by inferring the correct valences and charges that I was hoping datamol could fix for me.

If I load NAP structures from examples in the pdb (e.g. 5ocm) and simply transfer over bond annotations and atoms (formal charge is not specified in this PDB, so I'm assuming 0 charge) I end up with a structure like this:

smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"

# The correct smiles would be:
smi_correct = "c1cc(c[n+](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O-])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"

RDkit then fails to load this due to sanitization problems

mol = Chem.MolFromSmiles(smi)  # < fails
mol = Chem.MolFromSmiles(smi, sanitize=False)  # <works and produces the structure above, which is an invalid molecule

This molecule can be 'rescued' by assigning a positive charge to nitrogen number 4, but the datamol pipeline unfortunately fails to do this:

import datamol as dm

# Standardize and sanitize
mol = Chem.MolFromSmiles(smi, sanitize=False)
mol = dm.fix_mol(mol)
mol = dm.sanitize_mol(mol)
mol = dm.standardize_mol(mol)
Chem.SanitizeMol(mol)

Is there a way to fix this structure computationally with datamol?

The text was updated successfully, but these errors were encountered:

Croydon-Brixton · 2024-09-03T02:24:17Z

For reference this would be the corrected structure:

maclandrol · 2024-09-03T13:55:07Z

Hey @Croydon-Brixton, thanks for using datamol.

Your specific question is a bit tricky. There are 2 reasons why the "fixing" fail on your molecule:

RDKit Validation says the molecule is fundamentally fine, while sanitization fails. In fact the only reason why the molecule fails to parse is because of kekulization.

import datamol as dm
from rdkit.Chem.MolStandardize import rdMolStandardize

vm = rdMolStandardize.RDKitValidation()
smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
vm.validate(mol) # this is empty, so nothing technically wrong with the molecule in terms of connections.

In other words, the issue is because of the aromaticity perception of RDKit and kekulization failing.

smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
mol.UpdatePropertyCache(strict=False)
Chem.KekulizeIfPossible(mol) #  Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 5
# which is the pyridine rings, as atom 4 (uncharged Nitrogen) makes it impossible to assign double bonds.

Charge correction as done by datamol depends on the perceived current valence of the atoms vs the number of connections the atom can theoretically make. However when loading the molecules, the valence returned for atom 4 is:

implicit: 0
explicit: 3
total: 3
degree: 3
charge: 0

Therefore the algorithm skip over the atom, since everything looks fine, this is a direct consequence of each "aromatic" bond being perceived as a single connection at this point.

A naive fix for your specific case would be:

import datamol as dm
rxn = "[#7;X3;H0;r:1]>>[n+:1]"
rxn  = dm.reactions.rxn_from_smarts(rxn)
smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
mol = dm.sanitize_first(dm.reactions.apply_reaction(rxn, (mol,), product_index=0))
dm.to_image(mol, mol_size=(400, 200), indices=True, use_svg=False)

I doubt however that this solves the main issue. If you can share your goal and how you load the original structure, I am sure there are better and more systematic approaches (including perhaps not using RDKit here) that I can point you towards.

Croydon-Brixton · 2024-09-03T18:39:51Z

Thank you for the quick and detailed answer @maclandrol !

Yes, I'm looking for a solution for this general problem:

Problem statement: Given a molecule with the following bits of information: (1) heavy atoms (but no hydrogens, these are implicit), (2) bonded structure (single/double/triple/aromatic) is there a way to infer the formal charges and valence states that make it 'valid' (= pass RDKit sanitization) whilst preserving (1) and (2)?

The reason I am asking is that when retrieving ligands from the PDB the most reliable bits of information are the bonded structures of heavy atoms and the hybridization (which translates into the single/double/... flags), but formal charge is often entirely unspecified. I would like to have a way to turn these molecules into valid ones while preserving this information. Does that make sense?

I would need to do this programatically, as it will apply to many structures. I had a look at ChEMBL's pipeline, but this was not able to do the above task either for the example I gave.

Thank you for your input!

maclandrol · 2024-09-03T19:39:08Z

Yeah, the SMARTS patterns they have does not cover your case:

Quaternary N [N;X4;v4;+0:1]>>[*+1:1] is the closest, but unfortunately the N atom does not have "4 visible" connections for RDKit.

If you are loading from PDB and PDB only, then you should consider this:

Try: https://github.com/jensengroup/xyz2mol to check if you could leverage that.

Alternatively,

Fetch PDB
Separate ligand from target (you need some flexible logic here)
Parse ligand name or ID then make a request to ligand expo to get information about the ligand and its structure. Note, people mis-annotate ligand names all the time in RCSB however.
Align the ligand PDB template to the structure provided by ligand expo
Get the ligand structure and conformation.

This would probably be something useful for the community so I can definitely help implement this.

Croydon-Brixton changed the title ~~Fixing does not appear to work for inferring valence & formal charge states from corrupted PDB files~~ Fixing does not appear to work for inferring valence & formal charge states from molecules from some PDB files Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing does not appear to work for inferring valence & formal charge states from molecules from some PDB files #231

Fixing does not appear to work for inferring valence & formal charge states from molecules from some PDB files #231

Croydon-Brixton commented Sep 3, 2024 •

edited

Loading

Croydon-Brixton commented Sep 3, 2024

maclandrol commented Sep 3, 2024

Croydon-Brixton commented Sep 3, 2024

maclandrol commented Sep 3, 2024

Fixing does not appear to work for inferring valence & formal charge states from molecules from some PDB files #231

Fixing does not appear to work for inferring valence & formal charge states from molecules from some PDB files #231

Comments

Croydon-Brixton commented Sep 3, 2024 • edited Loading

Croydon-Brixton commented Sep 3, 2024

maclandrol commented Sep 3, 2024

Croydon-Brixton commented Sep 3, 2024

maclandrol commented Sep 3, 2024

Croydon-Brixton commented Sep 3, 2024 •

edited

Loading