tskit: metadata schema equality and hence table equality is not well-defined

Maybe “not well-defined” isn’t the right phrase, but it is currently pretty hard to get two tables or table collections produced by different sources to be equal to each other, because their metadata schema must match at the byte level, and so rely on whitespace and ordering of json objects matching. Furthermore, in python it seems impossible to diagnose the problem, since (a) equality of MetadataSchema objects is identity-only equality, (b) their dict or string representations may match even if the underlying bytes do not.

So, for instance:

>>> m1 = tskit.MetadataSchema({'codec': 'json', 'type':'object'})
>>> m2 = tskit.MetadataSchema({'type':'object', 'codec': 'json'})
>>> t = msprime.simulate(4).tables
>>> t2 = t1.copy()
>>> t1 == t2
True
>>> t1.metadata_schema = m1
>>> t2.metadata_schema = m2
>>> t1 == t2
False

See below for other examples.

This is being pretty bothersome for testing things in pyslim: the schema as set by SLiM differs from that set by pyslim in ordering and whitespace, even though the source code for the two are as identical as possible. Testing for equality of tables and table collections is important other places, too.

To do this properly we’d have to parse the json in C, right? Which we don’t really want to do - any other ideas? Even if we didn’t check metadata schema when testing equality, we’ll run into similar problems with top-level metadata, if we are proposing that applications edit the top-level metadata schema to add new keys - it could easily be that two operations commute except for the adding to the metadata schema.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

But maybe I should just not worry about that for the moment, since there’s nothing else that actually uses the top-level metadata?

Let’s not worry about this for now. We’re going to need to coordinate on some shared vocabulary here if we want the metadata to be interoperable, so let’s just use it as we like for the moment and figure out what’s useful later.