openff-toolkit: Spec for undefined stereochemistry

Purpose

Because it’s a complicated topic, and to maintain the ability to restart this discussion in the future, I think it’s important to record our design decisions with regard to stereochemistry in this repository.

Background

Stereochemical differences are important in the SMIRNOFF spec, as they can lead to different partial charges or partial bond orders on a molecule, and (if the project goes in that direction), different assignment of library parameters using stereochemistry-dependent SMARTS.

However, the definition of meaningful stereochemistry can change depending on experimental conditions and timescale. For example a tertiary amine will invert 10^3 to 10^5 times per second in solution. Therefore, on a wet-lab-experiment timescale, the amine does not have meaningful stereochemistry. However, if one is running a QM calculation on the tertiary amine, there may be meaningful differences in the results depending on its stereochemistry. Thus, whether or not stereochemistry is required in a tertiary amine depends on your use case.

Our definition of stereochemistry will ideally cover the fact that MD simulations can run for micro- and milli-seconds, therefore molecular features that are likely to change stereochemistry on those timescales should not be assigned stereochemistry-specific properties. It is unclear currently whether this decision should be made at the point of SMIRNOFF molecule definition (ie, we enforce that short-lived stereochemical features do not distinguish one OFF molecule from another, thereby preventing our engine from assigning them different simulation parameters), or if the SMARTS that define our parameter libraries should be policed moving forward, to ensure that no two parameters are differentiated only by low-barrier stereochemistry (eg, we will never create a SMARTS that relies on the stereochemistry of a tertiary amine).

Current implementation plans

Read below for details, but I think a good plan is to let the toolkits use their own (slightly different!) standards in choosing how to accept SMILES. While they mostly agree with what “fully specified stereochemistry” is, they differ on some borderline cases.

While this isn’t ideal, ensuring the same stereochemistry standard across toolkits will require a lot of engineering, and will restrict us to the union of the most restrictive stereochemistry standards of all of our toolkits (eg. OE omega complains if bond stereo isn’t defined for CC=NH, which is a bit much).

3D molecule input

We haven’t run into too many problems with sdf and mol2 input. OE and RDK currently use their default stereochemistry detection in these cases and seem to rarely have problems. I haven’t dug very deeply into these, so I will update this post as I look further into it.

Potentially relevant to 3D stereochemistry detection: #141, #137

SMILES input

“Common sense” atom and double bond stereochemistry is required, as is implemented by default in OE and RDK. There are, however, a few cases where disagreement occurs, both between OE and RDK, and between a toolkit and our intuition.

This table records the cases of obvious and border-case stereochemistry we’ve found so far, whether we think molecules should labeled as having undefined stereochemistry (on the MD timescale), and whether the toolkits define the SMILES as having undefined stereochemistry. If applicable, experimental barriers/interconversion rates from literature should be added.


SMILES Description Humans say it has undefined stereochem? OE says it has undefined stereochem? RDK says it has undefined stereochem? barrier conversion rate
C(Cl)(F)(Br)C Undefined tetrahedral center Yes Yes Yes
C(F)=C(Cl) Undefined C=C double bond Yes Yes Yes
C\C(F)=C(/F)CC(C)(Cl)Br Undefined stereocenter, defined stereobond Yes Yes Yes
CC(F)=C(F)C[C@@](C)(Cl)Br Defined stereocenter, undefined stereobond Yes Yes Yes
C\C(F)=C(/F)C[C@@](C)(Cl)Br defined stereocenter, defined stereobond No No No
CN=C(F)C Undefined imine Yes Yes Yes > 40 kcal/mol
N=C(F)C Undefined primary imine No Yes No
N(Cl)(F)C Undefined pyrimidal nitrogen No Optional [1] No 10^3 - 10^5 sec^-1
CP(O)(=O)N tetrahedral phosphate with one protonated oxygen Yes Yes Yes
CNC(C)=O keto-form of amide bond No No No
CN=C(C)O enol-form of amide bond [2] Yes Yes
C(=[O+][H])C oximium ? Yes No
N(C)(CC)(CCC)=O ? Yes? Yes No
C[C@H]1CN1C, may have tested using different method trivalent N in 3-membered ring Yes, as of 2019.3 Yes Yes
C[C@H]1CN1, may have tested using different method trivalent N in 3-membered ring (1 hydrogen) ? No? Yes
[H]c1c(c(c(c(c1[H])[H])P2C(=C3C(=C2c4c(c(c(c(n4)[H])[H])[H])[H])C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])c5c(c(c(c(n5)[H])[H])[H])[H])[H])[H] trivalent phosphorous ? No Yes considered stereogenic starting in RDK 2020.3 release https://github.com/rdkit/rdkit/issues/2949
CCS(=O)C thione ? ? ? ? ?
COP(=O)([O-])OCC deprotonated NA phosphate backbone N ? ? ? deprotonated at physiological pH (pKa=2ish)

[1] OE says “Yes” when using method A, but when “No” when method B is given enum_nitrogens=False. [2] I’m not certain what the rotation timescale for amide bonds is, but this might just be in molecular dynamics range. Letting the enol form of the amide get stereochemistry might give users the choice of whether they want it defined.

Code to reproduce the above

from rdkit import Chem 
from rdkit.Chem import EnumerateStereoisomers

# Define the SMILESes, 
smiles_ambiguous = [
                    ('C(Cl)(F)(Br)C', True), # Undefined tetrahedral center
                    ('C(F)=C(Cl)', True), # Undefined RC=CR 
                    ('C(F)(Br)=C(Cl)(I)', True), # Undefined RC=CR 
                    ("C\C(F)=C(/F)CC(C)(Cl)Br", True), # Undefined stereocenter, defined stereobond
                    ("CC(F)=C(F)C[C@@](C)(Cl)Br", True), # Defined stereocenter, undefined stereobond
                    ("C\C(F)=C(/F)C[C@@](C)(Cl)Br", False), # defined stereocenter, defined stereobond
                    ('CN=C(F)C', True), # RC=NR
                    ('N=C(F)C', False), # RC=NH
                    ('N(Cl)(F)C', False), #Pyrimidal nitrogen center
                    ('CP(O)(=O)N', True), #phosphate
                    ('CNC(C)=O', False), # keto form of amide
                    ('CN=C(C)O', False), #enol form of peptide
                    ('C(=[O+][H])C', False), # charged oxygen
                    ('N(C)(CC)(CCC)=O', True), # ?
                  ]


def rdkit_unspec_stereo(smiles):
    from rdkit import Chem
    from rdkit.Chem import EnumerateStereoisomers

    rdmol = Chem.MolFromSmiles(smiles)

    # Adding H's can hide undefined bond stereochemistry, so we have to test for undefined stereo here
    unspec_stereo = False
    rdmol_copy = Chem.Mol(rdmol)
    enumsi_opt = EnumerateStereoisomers.StereoEnumerationOptions(maxIsomers=2, onlyUnassigned=True)
    stereoisomers = [isomer for isomer in Chem.EnumerateStereoisomers.EnumerateStereoisomers(rdmol_copy, enumsi_opt)]
    if len(stereoisomers) != 1:
        unspec_stereo = True

    if unspec_stereo:
        return True
    return False

def oe_unspec_stereo_A(smiles):
    molecule = oechem.OEMol()
    oechem.OESmilesToMol(molecule, smiles)
    atoms_to_highlight = list()
    bonds_to_highlight = list()
    for atom in molecule.GetAtoms():
        if atom.IsChiral() and not atom.HasStereoSpecified(oechem.OEAtomStereo_Tetrahedral):
            # Check if handness is specified
            v = []
            for nbr in atom.GetAtoms():
                v.append(nbr)
            stereo = atom.GetStereo(v, oechem.OEAtomStereo_Tetrahedral)
            if stereo == oechem.OEAtomStereo_Undefined:
                atoms_to_highlight.append(atom)
    for bond in molecule.GetBonds():
        if bond.IsChiral() and not bond.HasStereoSpecified(oechem.OEBondStereo_CisTrans):
            v = []
            for neigh in bond.GetBgn().GetAtoms():
                if neigh != bond.GetEnd():
                    v.append(neigh)
                    break
            for neigh in bond.GetEnd().GetAtoms():
                if neigh != bond.GetBgn():
                    v.append(neigh)
                    break
            stereo = bond.GetStereo(v, oechem.OEBondStereo_CisTrans)

            if stereo == oechem.OEBondStereo_Undefined:
                bonds_to_highlight.append(bond)
    if len(atoms_to_highlight+bonds_to_highlight) == 0:
        return False
    return True
    

def oe_unspec_stereo_B(smiles):
    from openeye import oechem
    from openeye import oeomega
    maxcenters = 5
    force_flip = False
    enum_nitrogens = False
    mol = oechem.OEMol()
    if not oechem.OESmilesToMol(mol, smiles):
        print(s, "Not Parsed")
    n_isomers = 0
    for isomer in oeomega.OEFlipper(mol, maxcenters, force_flip, enum_nitrogens):
        n_isomers += 1
    if n_isomers == 1:
        return False
    elif n_isomers > 1:
        return True


method_list = [rdkit_unspec_stereo, oe_unspec_stereo_A, oe_unspec_stereo_B]
for smiles, should_be_ambiguous in smiles_ambiguous:
    method_results = [method(smiles) for method in method_list]
    print(smiles, should_be_ambiguous, method_results[0], method_results[1], method_results[2])

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 3
  • Comments: 16 (5 by maintainers)

Most upvoted comments

Sure:

In [3]: m = Chem.MolFromSmiles('C[C@H]([F:1])[F:2]')                                                                                          

In [4]: m.GetAtomWithIdx(1).GetPropsAsDict()                                                                                                  
Out[4]: 
{'__computedProps': <rdkit.rdBase._vectNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE at 0x7fdb2183e030>,
 '_CIPRank': 1,
 '_ChiralityPossible': 1,
 '_CIPCode': 'R'}

In [6]: for at in m.GetAtoms(): at.SetAtomMapNum(0)                                                                                           

In [8]: Chem.AssignStereochemistry(m,force=True,cleanIt=True,flagPossibleStereoCenters=True)                                                  

In [9]: m.GetAtomWithIdx(1).GetPropsAsDict()                                                                                                  
Out[9]: 
{'__computedProps': <rdkit.rdBase._vectNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE at 0x7fdb2183e630>,
 '_CIPRank': 1}

@jchodera. Agreed – In the interest of time, I need to table this issue for now, and the default toolkit behavior isn’t that bad. I don’t think this requires time at the January in-person meeting, though maybe we can plan an online meeting for stereochemistry enthusiasts later in January.

@ChayaSt Yes – One thing I’m coming to terms with on this issue is that stereochemistry is complicated. We want a certain behavior when processing a molecule for stereochemistry (for example, we want RC=NH to not require stereochemistry), but as it stands, each toolkit has decided what their built-in stereochemistry model does with that moiety (As you mentioned, it doesn’t help that RDKit’s stereo expectations change depending on whether H’s are added).

For that reason, and kind of to build toward John’s “compatibility” idea, I think we should treat this as a test-driven problem. That is, I’m planning on having a data set of SMILESes for which we (as CMILES/SMIRNOFF developers) say what behavior we want for each SMILES string, and then we create (potentially complicated) functions written using the different toolkits, until we get the behavior we want for all the SMILES strings in our data set. (OpenEye just wrote back suggesting we do this)