Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing does not appear to work for inferring valence & formal charge states from molecules from some PDB files #231

Open
Croydon-Brixton opened this issue Sep 3, 2024 · 4 comments

Comments

@Croydon-Brixton
Copy link

Croydon-Brixton commented Sep 3, 2024

Thank you for this nice library!

I'm have a question re fixing 'broken' Mols by inferring the correct valences and charges that I was hoping datamol could fix for me.

If I load NAP structures from examples in the pdb (e.g. 5ocm) and simply transfer over bond annotations and atoms (formal charge is not specified in this PDB, so I'm assuming 0 charge) I end up with a structure like this:

smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"

# The correct smiles would be:
smi_correct = "c1cc(c[n+](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O-])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"

Screenshot 2024-09-02 at 19 19 10

RDkit then fails to load this due to sanitization problems

mol = Chem.MolFromSmiles(smi)  # < fails
mol = Chem.MolFromSmiles(smi, sanitize=False)  # <works and produces the structure above, which is an invalid molecule

This molecule can be 'rescued' by assigning a positive charge to nitrogen number 4, but the datamol pipeline unfortunately fails to do this:

import datamol as dm

# Standardize and sanitize
mol = Chem.MolFromSmiles(smi, sanitize=False)
mol = dm.fix_mol(mol)
mol = dm.sanitize_mol(mol)
mol = dm.standardize_mol(mol)
Chem.SanitizeMol(mol)

Is there a way to fix this structure computationally with datamol?

@Croydon-Brixton
Copy link
Author

For reference this would be the corrected structure:
Screenshot 2024-09-02 at 19 24 03

@Croydon-Brixton Croydon-Brixton changed the title Fixing does not appear to work for inferring valence & formal charge states from corrupted PDB files Fixing does not appear to work for inferring valence & formal charge states from molecules from some PDB files Sep 3, 2024
@maclandrol
Copy link
Member

Hey @Croydon-Brixton, thanks for using datamol.

Your specific question is a bit tricky. There are 2 reasons why the "fixing" fail on your molecule:

  1. RDKit Validation says the molecule is fundamentally fine, while sanitization fails. In fact the only reason why the molecule fails to parse is because of kekulization.
import datamol as dm
from rdkit.Chem.MolStandardize import rdMolStandardize

vm = rdMolStandardize.RDKitValidation()
smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
vm.validate(mol) # this is empty, so nothing technically wrong with the molecule in terms of connections. 

In other words, the issue is because of the aromaticity perception of RDKit and kekulization failing.

smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
mol.UpdatePropertyCache(strict=False)
Chem.KekulizeIfPossible(mol) #  Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 5
# which is the pyridine rings, as atom 4 (uncharged Nitrogen) makes it impossible to assign double bonds. 
  1. Charge correction as done by datamol depends on the perceived current valence of the atoms vs the number of connections the atom can theoretically make. However when loading the molecules, the valence returned for atom 4 is:
  • implicit: 0
  • explicit: 3
  • total: 3
  • degree: 3
  • charge: 0

Therefore the algorithm skip over the atom, since everything looks fine, this is a direct consequence of each "aromatic" bond being perceived as a single connection at this point.

A naive fix for your specific case would be:

import datamol as dm
rxn = "[#7;X3;H0;r:1]>>[n+:1]"
rxn  = dm.reactions.rxn_from_smarts(rxn)
smi = "c1cc(c[n](c1)[C@H]2[C@@H]([C@@H]([C@H](O2)CO[P@@](=O)([O])O[P@](=O)(O)OC[C@@H]3[C@H]([C@H]([C@@H](O3)n4cnc5c4ncnc5N)OP(=O)(O)O)O)O)O)C(=O)N"
mol = dm.to_mol(smi, sanitize=False)
mol = dm.sanitize_first(dm.reactions.apply_reaction(rxn, (mol,), product_index=0))
dm.to_image(mol, mol_size=(400, 200), indices=True, use_svg=False)

image

I doubt however that this solves the main issue. If you can share your goal and how you load the original structure, I am sure there are better and more systematic approaches (including perhaps not using RDKit here) that I can point you towards.

@Croydon-Brixton
Copy link
Author

Thank you for the quick and detailed answer @maclandrol !

Yes, I'm looking for a solution for this general problem:

Problem statement: Given a molecule with the following bits of information: (1) heavy atoms (but no hydrogens, these are implicit), (2) bonded structure (single/double/triple/aromatic) is there a way to infer the formal charges and valence states that make it 'valid' (= pass RDKit sanitization) whilst preserving (1) and (2)?

The reason I am asking is that when retrieving ligands from the PDB the most reliable bits of information are the bonded structures of heavy atoms and the hybridization (which translates into the single/double/... flags), but formal charge is often entirely unspecified. I would like to have a way to turn these molecules into valid ones while preserving this information. Does that make sense?

I would need to do this programatically, as it will apply to many structures. I had a look at ChEMBL's pipeline, but this was not able to do the above task either for the example I gave.

Thank you for your input!

@maclandrol
Copy link
Member

Yeah, the SMARTS patterns they have does not cover your case:

Quaternary N [N;X4;v4;+0:1]>>[*+1:1] is the closest, but unfortunately the N atom does not have "4 visible" connections for RDKit.

If you are loading from PDB and PDB only, then you should consider this:

  1. Try: https://github.com/jensengroup/xyz2mol to check if you could leverage that.

Alternatively,

  1. Fetch PDB
  2. Separate ligand from target (you need some flexible logic here)
  3. Parse ligand name or ID then make a request to ligand expo to get information about the ligand and its structure. Note, people mis-annotate ligand names all the time in RCSB however.
  4. Align the ligand PDB template to the structure provided by ligand expo
  5. Get the ligand structure and conformation.

This would probably be something useful for the community so I can definitely help implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants