tskit: Valid TS should not allow nodes from multiple populations within the same individual
At the moment, it appears as if it is valid to create nodes within the same individual that belong to different populations (code below). I think we should ban this, and raise an error when tables with this property are turned into a tree sequence. Does this seem sensible? I don’t think that having a single individual containing two sample nodes each belong to a different population makes much sense.
ts = msprime.simulate(4, random_seed=1)
tables = ts.tables
tables.individuals.add_row()
tables.individuals.add_row()
ind = tables.nodes.individual
pop = tables.nodes.population
ind[ts.samples()] = np.arange(ts.num_samples)//2
tables.nodes.individual = ind
pop = np.full(ts.num_nodes, tskit.NULL, dtype=tables.nodes.population.dtype)
pop[0] = 0 # only sample 0 has a population, but
tables.nodes.population = pop
new_ts = tables.tree_sequence()
print(str(new_ts.tables.nodes))
id flags population individual time metadata
0 1 0 0 0.00000000000000
1 1 -1 0 0.00000000000000
2 1 -1 1 0.00000000000000
3 1 -1 1 0.00000000000000
4 0 -1 -1 0.17986860883729
5 0 -1 -1 1.97752050080185
6 0 -1 -1 2.69754599651364
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (20 by maintainers)
Closing this due to inactivity - unless we have a compelling reason to change the semantics here, I don’t think we should.
This basically is redundancy in the data model @hyanwong, but redundancy isn’t always a bad thing (in my old job as a storage engineer redundancy is something to strive for!). My take is that either we embrace this redundancy, seeing that it might come in useful at some point or we get rid of the redundancy entirely and move population and time refs etc into the individual. I don’t like the halfway house you’re proposing, where we’re formally acknowledging that the data model is broken. I just don’t see what would be gained.
Having an optional check as @petrelharp suggests sounds like a great idea, though, as producers of tree sequences could check this as a sanity check.
I dunno - if we break anyone’s code because of this, isn’t it certainly a bug in their code? The adjacency of nodes from the same individual is another story, though.
Thanks @petrelharp . I’m just putting together the PR to check in the C routine
tsk_table_collection_check_node_integrityintables.c, which looks like the right place to me.