chempy: Interpreting malformed chemical formulas as substances

Hi, thanks for the ChemPy package! As I was using Substance.from_formula, I noticed that one could easily make a mistake in the formula string and not be notified about it. For example, if I wanted to add methanol, if I don’t make any mistakes it works great:

>>> from chempy import Substance
>>> methanol = Substance.from_formula('CH3OH')
>>> methanol.name
'CH3OH'
>>> methanol.composition
{6: 1, 1: 4, 8: 1}

If I make a minor mistake, for example forgetting to capitalize the first H for hydrogen, ChemPy gives no warning and simply stops at the last valid element. So the formula string is interpreted as simply C for carbon, even though the name is the entire supplied formula string Ch3OH:

>>> c = Substance.from_formula('Ch3OH')
>>> c.name
'Ch3OH'
>>> c.composition
{6: 1}

Is there something like a strict flag (option) to throw a warning or even exception (error) if the entire formula string cannot be interpreted as a substance? If the entire formula string cannot be interpreted as a substance, should the substance’s name have only the part that was interpreted as a substance?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 32 (4 by maintainers)

Commits related to this issue

Most upvoted comments

IUPAC provisional recommendation (see IR-2.2.1 at bottom of page 3) lists braces as a grouping alternative, to be used alternately with parentheses to help avoid confusion:

I assumed that was the case. But nomenclature is such a wide and specialized area, sometimes it’s hard to keep track of it all.

By the way, how would treating square brackets as complexes (rather than simply a grouping alternative) affect the resulting substance?

Hopefully, it wouldn’t. I was looking more to add information as opposed to treating them differently. Things that depend on elemental composition should never know the difference. But, if something is in square brackets, then code could (optionally) treat it as a coordination complex (as opposed to a polyatomic ion or other group) and do appropriate stuff with it like deducing oxidation states, naming, decomposing the group into ligands, etc. I think my goal at the time was to use the parser with some additional nomenclature code to be able to interconvert formulas and names.

I think I actually have a version of the formula parser (or at least the grammar) in chempy that handles dot notation and brackets for complexes. As I recall it actually broke a fair amount of the current tests and time constraints stopped my inquiries in that direction.

While writing parsers for other situations I have used pyparsing’s parse exceptions to signal errors so it can be done (easily). I suppose the questions are what features are desirable in the formula parser (dot notation for hydrated crystals, square brackets for complexes, @ caged symbol, etc.), what behavior is desired for the strict parser flag, and whether these are separate issues.

If I can get some clarity on these issues, I will patch the relevant bits of my parser onto the current version and investigate further.