zebra: Intermittent long delays to inbound peer connection handshakes, Credit: Ziggurat Team

Motivation

The Ziggurat team have reported some long delays (25-40 seconds) in Zebra’s inbound connection handler.

The most likely cause is running blocking code inside async futures on the same thread or task as the connection handler.

The amount of code between accepting an inbound connection and sending a version message is quite small. It’s possible the async mutex in the nonce set or reading the state tip from the watch channel are causing delays, but it’s unlikely.

Next Diagnostic Steps

Run Zebra under strace or dtrace to work out what it’s waiting on during those 40 seconds. Run Zebra with tokio-console to see what the blocking task is.

Complex Code or Requirements

This seems like a concurrency bug.

Testing

Manually start up Zebra and check how long it takes to accept inbound connections.

Related Work

This issue is possibly related to:

Discussions of this bug

https://forum.zcashcommunity.com/t/ziggurat-3-0/43350/31
https://discord.com/channels/809218587167293450/809251029673312267/1111072914624757770
private communication to avoid leaking user IP addresses

About this issue

Original URL
State: closed
Created a year ago
Comments: 22 (22 by maintainers)

Commits related to this issue

fix(hang): Stop blocking some Zebra futures for up to a minute using a CPU busy-loop, Credit: Ziggurat Team (#6763), james_katz (#7000) (#7103) * Stop busy-waiting in a Future for 45 seconds every mi... — committed to ZcashFoundation/zebra by teor2345 a year ago

Most upvoted comments

Now I’m seeing some inbound handshake timeouts, but in the outer timeout wrapper.

This means that a handshake task took 500ms longer than the timeout inside that task. This should never happen, as long as the cooperative task scheduler is operating efficiently. So that means that the scheduler or the inbound task thread is regularly getting blocked for 500ms or more.

There’s about one timeout every few hours, which could explain the hangs in this ticket, because the connection needs to be made at the time of a hang to timeout. It’s also possible that the hang rate increases as the inbound connection load goes up.

Here are my logs, but they’re probably not that useful by themselves, I’ll go find context later:

2023-06-16T12:49:22.776673Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-16T15:37:22.776987Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T02:27:22.777531Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T05:29:22.777631Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T06:42:22.778207Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T06:52:22.777703Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T08:02:22.777531Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T09:40:22.778041Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T12:42:22.778240Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T21:04:22.778398Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-18T01:35:22.778959Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-18T10:38:22.779105Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-18T14:32:22.779325Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-18T18:32:22.779451Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-19T05:29:22.787021Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-19T22:18:22.787581Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-19T23:28:22.787793Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already

Next step is to find the source of the scheduler or thread hangs. It might be enough to look at the log context, I’ll try that when I get time.

We could also use tokio-console, dtrace, or some similar profiling tool. Ideally we could activate or inspect it slightly before the hang, and then see what Zebra is doing to start and end the hang.

teor2345 on Jun 20, 2023

After testing PR #6969 on a full sync and a synced instance for a day, I saw no inbound service timeouts, and one inbound handshake timeout:

2023-06-15T11:38:38.138980Z INFO {zebrad=“b29c9db” net=“Test”}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:38233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task

This is acceptable and doesn’t need any more changes, it looks like the bug is elsewhere.

teor2345 on Jun 16, 2023

2023-06-11T22:09:25.763933Z WARN {net=“Test”}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(2268403), hash: block::Hash(“000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d”) }

we’re downloading blocks in the wrong order (not strict height order) then hitting a concurrency limit so we can’t download the rest before the timeout

Was the node behind the block height (2268403) when the request timed out?

It was stuck just behind that block at a checkpoint:

2023-06-11T22:09:21.656378Z INFO {net=“Test”}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=95.806% current_height=Height(2268400) network_upgrade=Nu5 remaining_sync_blocks=99307 time_since_last_state_block=7m

The logs are all available on the Google Cloud instance.

I don’t see anything suspicious in the checkpoint verifier, maybe it was left in QueuedBlocks if the chain tip was still below 2268403 at that point.

I’ll double-check the downloader.

I would look at the checkpoint verifier or its downloader. We might need to add more info to the “stuck” log, or add a log for stuck checkpoints.

teor2345 on Jun 13, 2023

our state isn’t responding to all the UTXOs for the blocks it has (I think this is a known issue)

There’s #5125 but this block is far below the final testnet checkpoint.

The repeated timeouts are tracked by bug #5709

teor2345 on Jun 13, 2023

I’m seeing some unusual logs on our 1.0.0-rc.9 testnet node:

2023-06-11T22:09:25.763933Z WARN {net=“Test”}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(2268403), hash: block::Hash(“000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d”) } 2023-06-11T22:16:23.709438Z INFO {net=“Test”}:sync:try_to_sync:extend_tips: zebra_network::peer_set::set: all ready peers are missing inventory, failing request hash=Block(block::Hash(“000396228eeb43850b30b2441ca9391fb78dc622d6dd5649db1913efbe7dada0”))

Maybe we shouldn’t consider block request timeouts to be “missing inventory”. Or maybe this is caused by the hangs, and it will go away by itself when we fix them.

The “missing inventory” log is for a different block hash,

Possibly I coped the wrong log line, but you’re right they’re a thousand blocks apart.

the first log is probably from a queued block being dropped in the block write task, I don’t think they’re related.

The block write task returns a different error when it drops a block. So it’s either a download or verification timeout. (Or maybe a verifier hang due to unsatisfied verification constraints.)

Both block hashes are valid, so neither of those errors should have happened: https://blockexplorer.one/zcash/testnet/blockHash/000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d https://blockexplorer.one/zcash/testnet/blockHash/000396228eeb43850b30b2441ca9391fb78dc622d6dd5649db1913efbe7dada0

teor2345 on Jun 12, 2023

The Ziggurat team have done some more testing:

So one more small related finding - we’ve set up a synth node which sends GetAddr message requests periodically every few seconds.

What we have also observed in our active connection - once the zebrad enters this stuck state (where it can’t handle new inbound connections for a while) - it also stops replying to our GetAddr messages during the same period on this other active connection, we have established.

This suggests that either the inbound service is blocked, or the entire async thread or executor is blocked.

teor2345 on Jun 4, 2023