tskit: Node out of bounds when trying to simplify large tree sequence

The following model runs in SLiM (with a very long pause getting through the first generation, due to an unrelated bug):

initialize() {
	initializeSLiMModelType("nonWF");
	initializeTreeSeq(simplificationInterval=500);
	defineConstant("Np", 50);
	defineConstant("K", 1000000);	// total carrying capacity
	defineConstant("Kp", asInteger(K / Np));
	
	initializeMutationType("m1", 0.5, "f", 0.0);
	m1.convertToSubstitution = T;
	
	initializeGenomicElementType("g1", m1, 1.0);
	initializeGenomicElement(g1, 0, 600);
	initializeMutationRate(0);
	initializeRecombinationRate(0);
}
reproduction() {
	subpop.addCrossed(individual, subpop.sampleIndividuals(1));
}
1 early() {
	print(time());
	for (i in 1:Np)
		sim.addSubpop(i, Kp);
}
early() {
	for (subpop in sim.subpopulations)
		subpop.fitnessScaling = Kp / subpop.individualCount;
}
10000 late() {
	print(time());
	sim.outputFixedMutations();
}

When it reaches generation 500, SLiM tries to simplify, and at that point an error is triggered: “tsk_table_collection_sort: Node out of bounds”. The relevant line in SLiM’s code is at slim_sim.cpp:4787, where we call tsk_table_collection_sort() in preparation for simplification. Stepping into that call, it appears that the “Node out of bounds” error is thrown by table_sorter_init().

It’s a large model, and of course there has to be some upper limit to how big the tree-rec tables can be. I guess the bug here is perhaps that it seems like if you can record a given amount into the tree-rec tables, you ought to be able to simplify down from that. I.e. it feels to me like if there is an upper limit to how much can be handled, that limit ought to be hit during recording, not at the point you try to sort/simplify. Beyond being just a “gut feel”, this policy would also be better because if the sort/simplify hits an error for being too big, you have no idea how much earlier you ought to have done the sort/simplify; this model tries to do it every 500 generations, but given that it fails, what would a better interval be, to avoid the failure? There’s no way to know. Whereas if the error occurred during recording, it would at least be in a specific generation, and one could say “ah, it overflowed in generation 168; perhaps simplifying at 150 would be a good idea”.

But I suspect we also really just want to be moving to 64-bit table indices and such in tskit, to remove these overflow issues altogether, since we plan to be running on cluster nodes with very large amounts of memory.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 19 (14 by maintainers)

Most upvoted comments

Perhaps I will close this, and change the title of #343 so it’s not just about ragged arrays?

I’ve changed the title of #343.

jeromekelleher on May 14, 2020

Perhaps I will close this, and change the title of #343 so it’s not just about ragged arrays?

petrelharp on May 14, 2020

I think we will have to do 64 bit tables in tskit in the near future, but there’s too many other things happening right now to consider it at the moment. We should discuss this over in #343. I’ve tagged 64 bit tables as a C API 1.0 issue.

jeromekelleher on May 14, 2020

And, wow, @bhaller too on the lightning-quick github response!

petrelharp on May 14, 2020

Note that @molpopgen’s model is not parallel to the SLiM model, I think, because the SLiM model makes 50 subpops each of size 1000000, whereas his model makes only one; so it doesn’t chew through the table space required to cause the overflow, I think, unless I’m missing something.

bhaller on May 14, 2020

Gee, you noticed this quick. =) I’m just typing up the rationale…

petrelharp on May 14, 2020

Just to keep this one alive, I am unable to reproduce an overflow via a WF script written entirely in Python:

import tskit
import numpy as np

N = 1000000
NGENS = 500
LENGTH=1.

tc = tskit.TableCollection(LENGTH)

flags_array = np.zeros(2*N, dtype=np.uint32)
tc.nodes.set_columns(flags=flags_array,time=np.array([NGENS]*2*N))

np.random.seed(666)

l = np.zeros(2*N)
r = np.array([LENGTH]*2*N)
btime_array = np.array([NGENS]*2*N,dtype=np.float)

fnode = 0 
nnode = 2*N
for btime in range(NGENS-1,-1,-1):
    btime_array -= 1.0
    pnodes = np.arange(fnode, nnode, dtype=np.int32)
    p = np.random.choice(pnodes, 2*N, replace=True)
    c = np.arange(nnode, nnode+2*N, dtype=np.int32)
    tc.edges.append_columns(l,r,p,c)
    tc.nodes.append_columns(flags=flags_array, time=btime_array)
    fnode += 2*N
    nnode += 2*N

print(len(tc.nodes), len(tc.edges))
tc.sort()
tc.simplify(np.arange(len(tc.nodes)-2*N, len(tc.nodes), dtype=np.int32))
ts = tc.tree_sequence()

This script generated 1e9+2e6 nodes and 1e9 edges, and runs to completion against tskit 0.2.2 on Ubuntu 18.04 LTS and Python 3.6.8.

For the record, it took about 13 minutes and 105GB RAM.

molpopgen on Nov 20, 2019

What’s our thoughts about 64-bit tables? I guess we should discuss this in #343.

Short answer is, yes, we probably need them, but it’ll be a while before we do it. (But, we’ll have a full time engineer on the case, pretty soon.)

jeromekelleher on Nov 12, 2019