zebra: Intermittent long delays to inbound peer connection handshakes, Credit: Ziggurat Team
Motivation
The Ziggurat team have reported some long delays (25-40 seconds) in Zebra’s inbound connection handler.
The most likely cause is running blocking code inside async futures on the same thread or task as the connection handler.
The amount of code between accepting an inbound connection and sending a version message is quite small. It’s possible the async mutex in the nonce set or reading the state tip from the watch channel are causing delays, but it’s unlikely.
Next Diagnostic Steps
Run Zebra under strace
or dtrace
to work out what it’s waiting on during those 40 seconds.
Run Zebra with tokio-console
to see what the blocking task is.
Complex Code or Requirements
This seems like a concurrency bug.
Testing
Manually start up Zebra and check how long it takes to accept inbound connections.
Related Work
This issue is possibly related to:
Discussions of this bug
- https://forum.zcashcommunity.com/t/ziggurat-3-0/43350/31
- https://discord.com/channels/809218587167293450/809251029673312267/1111072914624757770
- private communication to avoid leaking user IP addresses
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 22 (22 by maintainers)
Now I’m seeing some inbound handshake timeouts, but in the outer timeout wrapper.
This means that a handshake task took 500ms longer than the timeout inside that task. This should never happen, as long as the cooperative task scheduler is operating efficiently. So that means that the scheduler or the inbound task thread is regularly getting blocked for 500ms or more.
There’s about one timeout every few hours, which could explain the hangs in this ticket, because the connection needs to be made at the time of a hang to timeout. It’s also possible that the hang rate increases as the inbound connection load goes up.
Here are my logs, but they’re probably not that useful by themselves, I’ll go find context later:
Next step is to find the source of the scheduler or thread hangs. It might be enough to look at the log context, I’ll try that when I get time.
We could also use tokio-console, dtrace, or some similar profiling tool. Ideally we could activate or inspect it slightly before the hang, and then see what Zebra is doing to start and end the hang.
After testing PR #6969 on a full sync and a synced instance for a day, I saw no inbound service timeouts, and one inbound handshake timeout:
This is acceptable and doesn’t need any more changes, it looks like the bug is elsewhere.
It was stuck just behind that block at a checkpoint:
The logs are all available on the Google Cloud instance.
I would look at the checkpoint verifier or its downloader. We might need to add more info to the “stuck” log, or add a log for stuck checkpoints.
The repeated timeouts are tracked by bug #5709
Possibly I coped the wrong log line, but you’re right they’re a thousand blocks apart.
The block write task returns a different error when it drops a block. So it’s either a download or verification timeout. (Or maybe a verifier hang due to unsatisfied verification constraints.)
Both block hashes are valid, so neither of those errors should have happened: https://blockexplorer.one/zcash/testnet/blockHash/000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d https://blockexplorer.one/zcash/testnet/blockHash/000396228eeb43850b30b2441ca9391fb78dc622d6dd5649db1913efbe7dada0
The Ziggurat team have done some more testing:
This suggests that either the inbound service is blocked, or the entire async thread or executor is blocked.