chempy: Interpreting malformed chemical formulas as substances
Hi, thanks for the ChemPy package! As I was using Substance.from_formula
, I noticed that one could easily make a mistake in the formula string and not be notified about it. For example, if I wanted to add methanol, if I don’t make any mistakes it works great:
>>> from chempy import Substance
>>> methanol = Substance.from_formula('CH3OH')
>>> methanol.name
'CH3OH'
>>> methanol.composition
{6: 1, 1: 4, 8: 1}
If I make a minor mistake, for example forgetting to capitalize the first H for hydrogen, ChemPy gives no warning and simply stops at the last valid element. So the formula string is interpreted as simply C
for carbon, even though the name
is the entire supplied formula string Ch3OH
:
>>> c = Substance.from_formula('Ch3OH')
>>> c.name
'Ch3OH'
>>> c.composition
{6: 1}
Is there something like a strict
flag (option) to throw a warning or even exception (error) if the entire formula string cannot be interpreted as a substance? If the entire formula string cannot be interpreted as a substance, should the substance’s name
have only the part that was interpreted as a substance?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 32 (4 by maintainers)
I assumed that was the case. But nomenclature is such a wide and specialized area, sometimes it’s hard to keep track of it all.
Hopefully, it wouldn’t. I was looking more to add information as opposed to treating them differently. Things that depend on elemental composition should never know the difference. But, if something is in square brackets, then code could (optionally) treat it as a coordination complex (as opposed to a polyatomic ion or other group) and do appropriate stuff with it like deducing oxidation states, naming, decomposing the group into ligands, etc. I think my goal at the time was to use the parser with some additional nomenclature code to be able to interconvert formulas and names.
I think I actually have a version of the formula parser (or at least the grammar) in chempy that handles dot notation and brackets for complexes. As I recall it actually broke a fair amount of the current tests and time constraints stopped my inquiries in that direction.
While writing parsers for other situations I have used pyparsing’s parse exceptions to signal errors so it can be done (easily). I suppose the questions are what features are desirable in the formula parser (dot notation for hydrated crystals, square brackets for complexes, @ caged symbol, etc.), what behavior is desired for the
strict
parser flag, and whether these are separate issues.If I can get some clarity on these issues, I will patch the relevant bits of my parser onto the current version and investigate further.