tskit: Edge case issue with simplify and missing data

I’ve just hit an edge case which is preventing me round-tripping some missing data examples. If we have an extreme edge of the genome in which only a single sample has non-missing data, then this can be represented by a tree at that point with only a single branch, connecting that sample to the root. However, if we run simplify() on such a tree sequence, the edge is removed (as it only contains unary nodes). That leaves the sample as an “isolated node”, and hence the missing data code in https://github.com/tskit-dev/tskit/pull/272/ flags it up as a case where the genotype should be set to -1, even though in this case, we do have information to properly encode the genotype.

I’m wondering if this is a issue with the missing data code, or the simplify() code? For example, in simplify() it might be considered reasonable not to drop unary nodes from a sample if they connect that sample to the root? But I’m not sure how the root would be identified in this case.

Ping @jeromekelleher and @petrelharp as they are the simplifying and missing data gurus 😃

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

Hm: I think that simplify is definately doing the right thing, as originally defined. That edge isn’t reflecting a genealogical relationship between the samples, which is how we’ve defined things.

I agree, although it was useful to think through.