tskit: Node out of bounds when trying to simplify large tree sequence
The following model runs in SLiM (with a very long pause getting through the first generation, due to an unrelated bug):
initialize() {
initializeSLiMModelType("nonWF");
initializeTreeSeq(simplificationInterval=500);
defineConstant("Np", 50);
defineConstant("K", 1000000); // total carrying capacity
defineConstant("Kp", asInteger(K / Np));
initializeMutationType("m1", 0.5, "f", 0.0);
m1.convertToSubstitution = T;
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 600);
initializeMutationRate(0);
initializeRecombinationRate(0);
}
reproduction() {
subpop.addCrossed(individual, subpop.sampleIndividuals(1));
}
1 early() {
print(time());
for (i in 1:Np)
sim.addSubpop(i, Kp);
}
early() {
for (subpop in sim.subpopulations)
subpop.fitnessScaling = Kp / subpop.individualCount;
}
10000 late() {
print(time());
sim.outputFixedMutations();
}
When it reaches generation 500, SLiM tries to simplify, and at that point an error is triggered: “tsk_table_collection_sort: Node out of bounds”. The relevant line in SLiM’s code is at slim_sim.cpp:4787, where we call tsk_table_collection_sort() in preparation for simplification. Stepping into that call, it appears that the “Node out of bounds” error is thrown by table_sorter_init().
It’s a large model, and of course there has to be some upper limit to how big the tree-rec tables can be. I guess the bug here is perhaps that it seems like if you can record a given amount into the tree-rec tables, you ought to be able to simplify down from that. I.e. it feels to me like if there is an upper limit to how much can be handled, that limit ought to be hit during recording, not at the point you try to sort/simplify. Beyond being just a “gut feel”, this policy would also be better because if the sort/simplify hits an error for being too big, you have no idea how much earlier you ought to have done the sort/simplify; this model tries to do it every 500 generations, but given that it fails, what would a better interval be, to avoid the failure? There’s no way to know. Whereas if the error occurred during recording, it would at least be in a specific generation, and one could say “ah, it overflowed in generation 168; perhaps simplifying at 150 would be a good idea”.
But I suspect we also really just want to be moving to 64-bit table indices and such in tskit, to remove these overflow issues altogether, since we plan to be running on cluster nodes with very large amounts of memory.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 19 (14 by maintainers)
I’ve changed the title of #343.
Perhaps I will close this, and change the title of #343 so it’s not just about ragged arrays?
I think we will have to do 64 bit tables in tskit in the near future, but there’s too many other things happening right now to consider it at the moment. We should discuss this over in #343. I’ve tagged 64 bit tables as a C API 1.0 issue.
And, wow, @bhaller too on the lightning-quick github response!
Note that @molpopgen’s model is not parallel to the SLiM model, I think, because the SLiM model makes 50 subpops each of size 1000000, whereas his model makes only one; so it doesn’t chew through the table space required to cause the overflow, I think, unless I’m missing something.
Gee, you noticed this quick. =) I’m just typing up the rationale…
Just to keep this one alive, I am unable to reproduce an overflow via a WF script written entirely in Python:
This script generated 1e9+2e6 nodes and 1e9 edges, and runs to completion against tskit 0.2.2 on Ubuntu 18.04 LTS and Python 3.6.8.
For the record, it took about 13 minutes and 105GB RAM.
Short answer is, yes, we probably need them, but it’ll be a while before we do it. (But, we’ll have a full time engineer on the case, pretty soon.)