go-spacemesh: Devnet hare failures

We fired up two devnets using spacecraft over the last 2-3 weeks and both experienced consensus failures. It happened to the first one (devnet1 a.k.a. the lane network in spacecraft) after ~5000 layers, but it happened to the second one (devnet3) after only 18 layers.

It seems to be down to a hare failure. I see this error for every layer:

2020-08-11T17:02:10.941Z        ERROR   f068b.hare              Fatal: PreRound ended with empty set    {"node_id": "f068bb7391359c002bf3621f9307284cf72766f63c0212ad135a66fa18bf4f89", "event": true, "layer_id": 357}
2020-08-11T17:05:30.941Z        ERROR   f068b.hare              Fatal: PreRound ended with empty set    {"node_id": "f068bb7391359c002bf3621f9307284cf72766f63c0212ad135a66fa18bf4f89", "event": true, "layer_id": 358}
2020-08-11T17:08:50.941Z        ERROR   f068b.hare              Fatal: PreRound ended with empty set    {"node_id": "f068bb7391359c002bf3621f9307284cf72766f63c0212ad135a66fa18bf4f89", "event": true, "layer_id": 359}

Full logs and config files for both networks are in this shared folder: https://drive.google.com/drive/folders/1UT4rMIaOd0vSGovwdhsAfzCrDNd1fEnL?usp=sharing

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (16 by maintainers)

Most upvoted comments

note that, at this point, the CPU usage is still fine. It doesn’t spike until ~11:35, or nearly 20 mins later.

This is not accurate. Remember that the CPU graph is a per-minute average. It starts jumping around at ~11:12, which means that it’s possible that while some goroutines have the focus it’s at 100% and they have the focus 60% of the time. By 11:35 they have the focus all the time and the average CPU usage is ~100%.
Remember that you’re not looking at the CPU usage of a representative node, but at the usage of all nodes combined. It’s possible that certain nodes are using plenty of CPU while others are deprived, which is what’s causing the late messages, causing more CPU load.
It’s very likely, imo, that this is a snowball effect started by insufficient resources during some random spike in CPU usage.
While it could be interesting to investigate this snowball effect and see how we can break out of it on low resource machines - this is not a priority right now and we shouldn’t waste time on it.
I’m fairly confident that if you run fewer miners per machine you will not see this situation happening again. I’d start with 25 per machine, on two machines.

noamnelke on Aug 17, 2020