tskit: Some questions about missing data

Sorry to bring this up again, but #714 made @bhaller ask some good questions. Two observations:

  1. It is surprising that simplify() can make a formerly non-missing genotype become missing.
  2. An isolated node is considered to have a missing genotype even if it has a mutation above it, which is also surprising.

The first thing is defensible, because we treat the state of every root as the “ancestral state” even though there’s got to be some element of unknown-ness. But it is certainly surprising, and makes me wonder if we can modify this behavior. The easiest way I could think of to modify it would be to *have some way to mark an isolated node as “not actually missing”. *

The second thing does not feel right.

Here’s a demonstration of (2), btw:

tables = tskit.TableCollection(sequence_length=1.0)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=1, flags=0)
tables.edges.add_row(parent=3, child=1, left=0, right=1)
tables.edges.add_row(parent=3, child=2, left=0, right=1)
tables.sites.add_row(position=0.5, ancestral_state='x')
tables.mutations.add_row(site=0, node=0, derived_state='y')
ts = tables.tree_sequence()
ts.genotype_matrix()
# array([[-1,  0,  0]], dtype=int8)

I don’t remember discussion of this point when working out missing-ness, but maybe I’m forgetting it?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 35 (29 by maintainers)

Most upvoted comments

I’m with @jeromekelleher, and ISOLATED_NOT_MISSING on the C side.

I’m coming around to isolated_as_missing=True - this seems to convey the most information, and since people will need to use the option very rarely (missing data will be handled correctly, by default), it’s OK that it’s a little on the long side.

Hi all, 0.3.0 is very close now thanks to #782. The name changes discussed above are the only remaining item.

isolated_are_missing seems to have most support if I’ve read the thread correctly. On the C side there is less consensus, but TSK_ISOLATED_NOT_MISSING seems to mirror the python name.

Unless there is any issue shall we go with these?

Gee, missing_data doesn’t sound too cryptic to me?

For the stats, remove_missing would be more descriptive, but that doesn’t work for e.g. .variants(). So, I’d vote for missing_data … and maybe TSK_NO_MISSING_DATA? What do you all think?

Thinking this through… for the name of the flag/option, the two options are essentially either (a) “do we intend isolated samples to be missing data” or (b) “let isolated samples take the ancestral state”. The first is more general, so we could re-use the name of the option in more other methods. But the second is more precise, and says what the option is actually making the method do.

Thinking ahead to statistics: what we’ll want to do is either treat isolated samples as ancestral, or remove them because they are missing (as with na.rm=TRUE in R). Either of the options above works, but I think option (a) better connotes what we will do if there is missing data? What about isolated_samples_missing, defaulting to True, and a C flag of TSK_ISOLATED_SAMPLES_NOT_MISSING?

Maybe something more explicit like TSK_ISOLATED_AS_ANCESTRAL?

@benjeffery, @petrelharp, @bhaller and I had a meeting to discuss this today. Here’s our conclusions:

  1. The current names for controlling the behaviour are confusing (TSK_IMPUTE_MISSING_DATA), so we need an alternative. I like @hyanwong’s idea of TSK_MISSING_AS_ANCESTRAL, but open to ideas here.
  2. An isolated node that has a mutation above it shouldn’t be considered missing. We should try to fix this if at all possible.
  3. Most of the reason that missing data is problematic currently in SLiM is because of the current behaviour of simplify, where we remove unary branches from the MRCA of samples back to more ancient nodes. We should change the default behaviour of simplify to keep these edges, because there’s important information in knowing how long back your simulation actually started.

I’ll take a look at the feasibility of 1 and 2 here and report back.

Any thoughts/objections?

Hi, folks - I want to make a plug for changing the name of impute_missing_data before the release - maybe it doesn’t matter, since we’ll have to keep the old name around anyhow? - but anyhow, here goes:

What “impute missing data” is doing is not really imputing at all, I’d argue. “Impute”, for genomes, usually means “fancy guessing based on other nearby genotypes”. What the option here is doing is just assigning the ancestral value. That is a natural guess for missing data, so that does count as “imputing”, but it’s still kinda surprising as at first guess people would assume that tskit would ‘impute’ using the trees somehow.

And, more seriously, there are other situations where isolated samples are not actually supposed to represent missing data. SLiM simulations, for instance. To get things to work out correctly for those, we need to set impute_missing_data=True, when really what we mean is isolated_samples_are_missing=False. So, we can certainly communicate that this is what everyone needs to do, but I anticipate a fair amount of confusion.

So: my proposal is to rename impute_missing_data=True to isolated_samples_are_missing=False or perhaps even just missing_data=False.

Maybe I hadn’t updated the tskit code since before 0.2?

Is there any problem with saying that it only counts as missing data if the nice is isolation and has no mutation above it? That way we would at least be able to represent that “this node isn’t related to anyone else, but has allele ‘A’”.