tskit: Some questions about missing data
Sorry to bring this up again, but #714 made @bhaller ask some good questions. Two observations:
- It is surprising that
simplify()can make a formerly non-missing genotype become missing. - An isolated node is considered to have a missing genotype even if it has a mutation above it, which is also surprising.
The first thing is defensible, because we treat the state of every root as the “ancestral state” even though there’s got to be some element of unknown-ness. But it is certainly surprising, and makes me wonder if we can modify this behavior. The easiest way I could think of to modify it would be to *have some way to mark an isolated node as “not actually missing”. *
The second thing does not feel right.
Here’s a demonstration of (2), btw:
tables = tskit.TableCollection(sequence_length=1.0)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=1, flags=0)
tables.edges.add_row(parent=3, child=1, left=0, right=1)
tables.edges.add_row(parent=3, child=2, left=0, right=1)
tables.sites.add_row(position=0.5, ancestral_state='x')
tables.mutations.add_row(site=0, node=0, derived_state='y')
ts = tables.tree_sequence()
ts.genotype_matrix()
# array([[-1, 0, 0]], dtype=int8)
I don’t remember discussion of this point when working out missing-ness, but maybe I’m forgetting it?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 35 (29 by maintainers)
I’m with @jeromekelleher, and
ISOLATED_NOT_MISSINGon the C side.I’m coming around to
isolated_as_missing=True- this seems to convey the most information, and since people will need to use the option very rarely (missing data will be handled correctly, by default), it’s OK that it’s a little on the long side.Hi all, 0.3.0 is very close now thanks to #782. The name changes discussed above are the only remaining item.
isolated_are_missingseems to have most support if I’ve read the thread correctly. On the C side there is less consensus, butTSK_ISOLATED_NOT_MISSINGseems to mirror the python name.Unless there is any issue shall we go with these?
Gee,
missing_datadoesn’t sound too cryptic to me?For the stats,
remove_missingwould be more descriptive, but that doesn’t work for e.g..variants(). So, I’d vote formissing_data… and maybeTSK_NO_MISSING_DATA? What do you all think?Thinking this through… for the name of the flag/option, the two options are essentially either (a) “do we intend isolated samples to be missing data” or (b) “let isolated samples take the ancestral state”. The first is more general, so we could re-use the name of the option in more other methods. But the second is more precise, and says what the option is actually making the method do.
Thinking ahead to statistics: what we’ll want to do is either treat isolated samples as ancestral, or remove them because they are missing (as with
na.rm=TRUEin R). Either of the options above works, but I think option (a) better connotes what we will do if there is missing data? What aboutisolated_samples_missing, defaulting toTrue, and a C flag of TSK_ISOLATED_SAMPLES_NOT_MISSING?Maybe something more explicit like
TSK_ISOLATED_AS_ANCESTRAL?@benjeffery, @petrelharp, @bhaller and I had a meeting to discuss this today. Here’s our conclusions:
I’ll take a look at the feasibility of 1 and 2 here and report back.
Any thoughts/objections?
Hi, folks - I want to make a plug for changing the name of
impute_missing_databefore the release - maybe it doesn’t matter, since we’ll have to keep the old name around anyhow? - but anyhow, here goes:What “impute missing data” is doing is not really imputing at all, I’d argue. “Impute”, for genomes, usually means “fancy guessing based on other nearby genotypes”. What the option here is doing is just assigning the ancestral value. That is a natural guess for missing data, so that does count as “imputing”, but it’s still kinda surprising as at first guess people would assume that tskit would ‘impute’ using the trees somehow.
And, more seriously, there are other situations where isolated samples are not actually supposed to represent missing data. SLiM simulations, for instance. To get things to work out correctly for those, we need to set
impute_missing_data=True, when really what we mean isisolated_samples_are_missing=False. So, we can certainly communicate that this is what everyone needs to do, but I anticipate a fair amount of confusion.So: my proposal is to rename
impute_missing_data=Truetoisolated_samples_are_missing=Falseor perhaps even justmissing_data=False.Maybe I hadn’t updated the tskit code since before 0.2?
Is there any problem with saying that it only counts as missing data if the nice is isolation and has no mutation above it? That way we would at least be able to represent that “this node isn’t related to anyone else, but has allele ‘A’”.