tskit: Valid TS should not allow nodes from multiple populations within the same individual

At the moment, it appears as if it is valid to create nodes within the same individual that belong to different populations (code below). I think we should ban this, and raise an error when tables with this property are turned into a tree sequence. Does this seem sensible? I don’t think that having a single individual containing two sample nodes each belong to a different population makes much sense.

ts = msprime.simulate(4, random_seed=1)
tables = ts.tables
tables.individuals.add_row()
tables.individuals.add_row()
ind = tables.nodes.individual
pop = tables.nodes.population
ind[ts.samples()] = np.arange(ts.num_samples)//2
tables.nodes.individual = ind
pop = np.full(ts.num_nodes, tskit.NULL, dtype=tables.nodes.population.dtype)
pop[0] = 0  # only sample 0 has a population, but 
tables.nodes.population = pop
new_ts = tables.tree_sequence()
print(str(new_ts.tables.nodes))

id	flags	population	individual	time	metadata
0	1	0	0	0.00000000000000	
1	1	-1	0	0.00000000000000	
2	1	-1	1	0.00000000000000	
3	1	-1	1	0.00000000000000	
4	0	-1	-1	0.17986860883729	
5	0	-1	-1	1.97752050080185	
6	0	-1	-1	2.69754599651364	

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

Closing this due to inactivity - unless we have a compelling reason to change the semantics here, I don’t think we should.

This basically is redundancy in the data model @hyanwong, but redundancy isn’t always a bad thing (in my old job as a storage engineer redundancy is something to strive for!). My take is that either we embrace this redundancy, seeing that it might come in useful at some point or we get rid of the redundancy entirely and move population and time refs etc into the individual. I don’t like the halfway house you’re proposing, where we’re formally acknowledging that the data model is broken. I just don’t see what would be gained.

Having an optional check as @petrelharp suggests sounds like a great idea, though, as producers of tree sequences could check this as a sanity check.

there’s a very good chance we’ll break someone’s code if we start doing so now.

I dunno - if we break anyone’s code because of this, isn’t it certainly a bug in their code? The adjacency of nodes from the same individual is another story, though.

Thanks @petrelharp . I’m just putting together the PR to check in the C routine tsk_table_collection_check_node_integrity in tables.c, which looks like the right place to me.