openff-toolkit: Spec for undefined stereochemistry
Purpose
Because it’s a complicated topic, and to maintain the ability to restart this discussion in the future, I think it’s important to record our design decisions with regard to stereochemistry in this repository.
Background
Stereochemical differences are important in the SMIRNOFF spec, as they can lead to different partial charges or partial bond orders on a molecule, and (if the project goes in that direction), different assignment of library parameters using stereochemistry-dependent SMARTS.
However, the definition of meaningful stereochemistry can change depending on experimental conditions and timescale. For example a tertiary amine will invert 10^3 to 10^5 times per second in solution. Therefore, on a wet-lab-experiment timescale, the amine does not have meaningful stereochemistry. However, if one is running a QM calculation on the tertiary amine, there may be meaningful differences in the results depending on its stereochemistry. Thus, whether or not stereochemistry is required in a tertiary amine depends on your use case.
Our definition of stereochemistry will ideally cover the fact that MD simulations can run for micro- and milli-seconds, therefore molecular features that are likely to change stereochemistry on those timescales should not be assigned stereochemistry-specific properties. It is unclear currently whether this decision should be made at the point of SMIRNOFF molecule definition (ie, we enforce that short-lived stereochemical features do not distinguish one OFF molecule from another, thereby preventing our engine from assigning them different simulation parameters), or if the SMARTS that define our parameter libraries should be policed moving forward, to ensure that no two parameters are differentiated only by low-barrier stereochemistry (eg, we will never create a SMARTS that relies on the stereochemistry of a tertiary amine).
Current implementation plans
Read below for details, but I think a good plan is to let the toolkits use their own (slightly different!) standards in choosing how to accept SMILES. While they mostly agree with what “fully specified stereochemistry” is, they differ on some borderline cases.
While this isn’t ideal, ensuring the same stereochemistry standard across toolkits will require a lot of engineering, and will restrict us to the union of the most restrictive stereochemistry standards of all of our toolkits (eg. OE omega complains if bond stereo isn’t defined for CC=NH, which is a bit much).
3D molecule input
We haven’t run into too many problems with sdf and mol2 input. OE and RDK currently use their default stereochemistry detection in these cases and seem to rarely have problems. I haven’t dug very deeply into these, so I will update this post as I look further into it.
Potentially relevant to 3D stereochemistry detection: #141, #137
SMILES input
“Common sense” atom and double bond stereochemistry is required, as is implemented by default in OE and RDK. There are, however, a few cases where disagreement occurs, both between OE and RDK, and between a toolkit and our intuition.
This table records the cases of obvious and border-case stereochemistry we’ve found so far, whether we think molecules should labeled as having undefined stereochemistry (on the MD timescale), and whether the toolkits define the SMILES as having undefined stereochemistry. If applicable, experimental barriers/interconversion rates from literature should be added.
| SMILES | Description | Humans say it has undefined stereochem? | OE says it has undefined stereochem? | RDK says it has undefined stereochem? | barrier | conversion rate |
|---|---|---|---|---|---|---|
C(Cl)(F)(Br)C |
Undefined tetrahedral center | Yes | Yes | Yes | ||
C(F)=C(Cl) |
Undefined C=C double bond | Yes | Yes | Yes | ||
C\C(F)=C(/F)CC(C)(Cl)Br |
Undefined stereocenter, defined stereobond | Yes | Yes | Yes | ||
CC(F)=C(F)C[C@@](C)(Cl)Br |
Defined stereocenter, undefined stereobond | Yes | Yes | Yes | ||
C\C(F)=C(/F)C[C@@](C)(Cl)Br |
defined stereocenter, defined stereobond | No | No | No | ||
CN=C(F)C |
Undefined imine | Yes | Yes | Yes | > 40 kcal/mol | |
N=C(F)C |
Undefined primary imine | No | Yes | No | ||
N(Cl)(F)C |
Undefined pyrimidal nitrogen | No | Optional [1] | No | 10^3 - 10^5 sec^-1 | |
CP(O)(=O)N |
tetrahedral phosphate with one protonated oxygen | Yes | Yes | Yes | ||
CNC(C)=O |
keto-form of amide bond | No | No | No | ||
CN=C(C)O |
enol-form of amide bond | [2] | Yes | Yes | ||
C(=[O+][H])C |
oximium | ? | Yes | No | ||
N(C)(CC)(CCC)=O |
? | Yes? | Yes | No | ||
C[C@H]1CN1C, may have tested using different method |
trivalent N in 3-membered ring | Yes, as of 2019.3 | Yes | Yes | ||
C[C@H]1CN1, may have tested using different method |
trivalent N in 3-membered ring (1 hydrogen) | ? | No? | Yes | ||
[H]c1c(c(c(c(c1[H])[H])P2C(=C3C(=C2c4c(c(c(c(n4)[H])[H])[H])[H])C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])c5c(c(c(c(n5)[H])[H])[H])[H])[H])[H] |
trivalent phosphorous | ? | No | Yes | considered stereogenic starting in RDK 2020.3 release https://github.com/rdkit/rdkit/issues/2949 | |
CCS(=O)C |
thione | ? | ? | ? | ? | ? |
COP(=O)([O-])OCC |
deprotonated NA phosphate backbone | N | ? | ? | ? | deprotonated at physiological pH (pKa=2ish) |
[1] OE says “Yes” when using method A, but when “No” when method B is given enum_nitrogens=False.
[2] I’m not certain what the rotation timescale for amide bonds is, but this might just be in molecular dynamics range. Letting the enol form of the amide get stereochemistry might give users the choice of whether they want it defined.
Code to reproduce the above
from rdkit import Chem
from rdkit.Chem import EnumerateStereoisomers
# Define the SMILESes,
smiles_ambiguous = [
('C(Cl)(F)(Br)C', True), # Undefined tetrahedral center
('C(F)=C(Cl)', True), # Undefined RC=CR
('C(F)(Br)=C(Cl)(I)', True), # Undefined RC=CR
("C\C(F)=C(/F)CC(C)(Cl)Br", True), # Undefined stereocenter, defined stereobond
("CC(F)=C(F)C[C@@](C)(Cl)Br", True), # Defined stereocenter, undefined stereobond
("C\C(F)=C(/F)C[C@@](C)(Cl)Br", False), # defined stereocenter, defined stereobond
('CN=C(F)C', True), # RC=NR
('N=C(F)C', False), # RC=NH
('N(Cl)(F)C', False), #Pyrimidal nitrogen center
('CP(O)(=O)N', True), #phosphate
('CNC(C)=O', False), # keto form of amide
('CN=C(C)O', False), #enol form of peptide
('C(=[O+][H])C', False), # charged oxygen
('N(C)(CC)(CCC)=O', True), # ?
]
def rdkit_unspec_stereo(smiles):
from rdkit import Chem
from rdkit.Chem import EnumerateStereoisomers
rdmol = Chem.MolFromSmiles(smiles)
# Adding H's can hide undefined bond stereochemistry, so we have to test for undefined stereo here
unspec_stereo = False
rdmol_copy = Chem.Mol(rdmol)
enumsi_opt = EnumerateStereoisomers.StereoEnumerationOptions(maxIsomers=2, onlyUnassigned=True)
stereoisomers = [isomer for isomer in Chem.EnumerateStereoisomers.EnumerateStereoisomers(rdmol_copy, enumsi_opt)]
if len(stereoisomers) != 1:
unspec_stereo = True
if unspec_stereo:
return True
return False
def oe_unspec_stereo_A(smiles):
molecule = oechem.OEMol()
oechem.OESmilesToMol(molecule, smiles)
atoms_to_highlight = list()
bonds_to_highlight = list()
for atom in molecule.GetAtoms():
if atom.IsChiral() and not atom.HasStereoSpecified(oechem.OEAtomStereo_Tetrahedral):
# Check if handness is specified
v = []
for nbr in atom.GetAtoms():
v.append(nbr)
stereo = atom.GetStereo(v, oechem.OEAtomStereo_Tetrahedral)
if stereo == oechem.OEAtomStereo_Undefined:
atoms_to_highlight.append(atom)
for bond in molecule.GetBonds():
if bond.IsChiral() and not bond.HasStereoSpecified(oechem.OEBondStereo_CisTrans):
v = []
for neigh in bond.GetBgn().GetAtoms():
if neigh != bond.GetEnd():
v.append(neigh)
break
for neigh in bond.GetEnd().GetAtoms():
if neigh != bond.GetBgn():
v.append(neigh)
break
stereo = bond.GetStereo(v, oechem.OEBondStereo_CisTrans)
if stereo == oechem.OEBondStereo_Undefined:
bonds_to_highlight.append(bond)
if len(atoms_to_highlight+bonds_to_highlight) == 0:
return False
return True
def oe_unspec_stereo_B(smiles):
from openeye import oechem
from openeye import oeomega
maxcenters = 5
force_flip = False
enum_nitrogens = False
mol = oechem.OEMol()
if not oechem.OESmilesToMol(mol, smiles):
print(s, "Not Parsed")
n_isomers = 0
for isomer in oeomega.OEFlipper(mol, maxcenters, force_flip, enum_nitrogens):
n_isomers += 1
if n_isomers == 1:
return False
elif n_isomers > 1:
return True
method_list = [rdkit_unspec_stereo, oe_unspec_stereo_A, oe_unspec_stereo_B]
for smiles, should_be_ambiguous in smiles_ambiguous:
method_results = [method(smiles) for method in method_list]
print(smiles, should_be_ambiguous, method_results[0], method_results[1], method_results[2])
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 3
- Comments: 16 (5 by maintainers)
Sure:
@jchodera. Agreed – In the interest of time, I need to table this issue for now, and the default toolkit behavior isn’t that bad. I don’t think this requires time at the January in-person meeting, though maybe we can plan an online meeting for stereochemistry enthusiasts later in January.
@ChayaSt Yes – One thing I’m coming to terms with on this issue is that stereochemistry is complicated. We want a certain behavior when processing a molecule for stereochemistry (for example, we want
RC=NHto not require stereochemistry), but as it stands, each toolkit has decided what their built-in stereochemistry model does with that moiety (As you mentioned, it doesn’t help that RDKit’s stereo expectations change depending on whether H’s are added).For that reason, and kind of to build toward John’s “compatibility” idea, I think we should treat this as a test-driven problem. That is, I’m planning on having a data set of SMILESes for which we (as CMILES/SMIRNOFF developers) say what behavior we want for each SMILES string, and then we create (potentially complicated) functions written using the different toolkits, until we get the behavior we want for all the SMILES strings in our data set. (OpenEye just wrote back suggesting we do this)