rust-libp2p: Random upgrade timeouts between peers

Hello everyone,

I’m having a weird issue in which, sometimes, a substream will timeout during its upgrade. I have little knowledge in rust futures and in this project’s internals so I’m sorry if I misinterpret some stuff. You can find a commit with my debug logs here.

Context

I’m running on my local machine 3 services (let’s call them node A / B / C), all of them using rust-libp2p. The swarm created uses a custom NetworkBehaviour which has 3 behaviours:

Kademlia
Identify
A custom behaviour I made

The transport used is plaintext TCP using the mplex streams multiplexer. I have some other issues with yamux which made me use mplex.

My mplex configuration:

let mut mplex = MplexConfig::default();
mplex.max_buffer_len_behaviour(MaxBufferBehaviour::Block);
mplex.split_send_size(1024 * 1024);
mplex.max_substreams(100000); // Yes, this is insane, it's to workaround another issue I had and I didn't take the time to lower it yet

I’m using rust-libp2p 0.20.1 but had the same issue on 0.19.1. The issue may be related to timing. It might not occur in debug mode with enough logs to “slow” the services but release mode almost always triggers it. Also, I started seeing it after a refactor of my code which seems to have brought a performance boost as well.

The 3 nodes connect to each other on my local machine using 127.0.0.1:port . Node B then advertise 1/2k providers records into the DHT. Node C will then receive requests from another external service and will do providers lookups into the DHT. The timeout randomly appears during those lookups. From time to time I get something like Request to PeerId("12D3KooWAZ1gXB3ZokKzEnkXvWDK6RWj38rNkkeVTsqyevL5AYqV") in query QueryId(67) failed with Upgrade(Timeout). The timeout is the default one, 10 seconds.

Upgrade timeout

I started using RUST_LOG=trace to get all the logs I could and started tracing the issue. It seems that node C successfully creates the substream and starts sending data to node B. On node B, I can see the created substream:

Substream 68 logs

2020-06-25T14:46:00.419Z TRACE libp2p_mplex                                  > Received message: Open { substream_id: 68 }
2020-06-25T14:46:00.419Z TRACE libp2p_mplex                                  > Filter pass for Open { substream_id: 68 }
2020-06-25T14:46:00.419Z DEBUG libp2p_mplex                                  > Successfully opened inbound substream 68
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll stream for 0
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Got Poll::Pending
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll stream for 0
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Got Poll::Pending
2020-06-25T14:46:00.419Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 68, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }
2020-06-25T14:46:00.419Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 68, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }, substream_num=2, endpoint=Listener
2020-06-25T14:46:09.000Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:10.419Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-25T14:46:10.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0

I added a few logs in node_handler and mplex, the whole commit is my debug attempt.

From what I see here, the newly created substream never leaves the negotiating_in “state” and I’m not sure why. For example, here’s another substream successfully created right before substream 68 would be created:

Substream 67 logs

2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Received message: Open { substream_id: 67 }
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Filter pass for Open { substream_id: 67 }
2020-06-25T14:46:00.373Z DEBUG libp2p_mplex                                  > Successfully opened inbound substream 67
2020-06-25T14:46:00.373Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-25T14:46:00.373Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0
2020-06-25T14:46:00.373Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll stream for 0
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 67, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 67, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }, substream_num=67, endpoint=Listener
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Filter pass for Data { substream_id: 67, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.373Z TRACE multistream_select::protocol                  > Received message: Header(V1)
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.374Z TRACE multistream_select::protocol                  > Received message: Protocol(Protocol(b"/ipfs/kad/1.0.0"))
2020-06-25T14:46:00.374Z DEBUG multistream_select::listener_select           > Listener: confirming protocol: /ipfs/kad/1.0.0
2020-06-25T14:46:00.374Z DEBUG multistream_select::listener_select           > Listener: sent confirmed protocol: /ipfs/kad/1.0.0
2020-06-25T14:46:00.374Z DEBUG libp2p_core::upgrade::apply                   > Successfully applied negotiated protocol
2020-06-25T14:46:00.374Z TRACE libp2p_swarm::protocols_handler::node_handler > Got Poll::Ready(Ok(upgrade))
2020-06-25T14:46:00.374Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:00.374Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 67, endpoint: Dialer, data: b"E\x08\x03\x12?/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jarP\n" }
2020-06-25T14:46:00.374Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:00.374Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 67, endpoint: Dialer, data: b"E\x08\x03\x12?/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jarP\n" }, substream_num=2, endpoint=Listener
2020-06-25T14:46:00.374Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 67, endpoint: Dialer, data: b"E\x08\x03\x12?/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jarP\n" }, substream_num=67, endpoint=Listener
2020-06-25T14:46:00.375Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.375Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:00.375Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:00.376Z TRACE libp2p_mplex                                  > Received message: Close { substream_id: 67, endpoint: Dialer }

As you can see, the difference between substreams 67 and 68 is that the “filtering” function (logs starting with the Filtering. log) successfully returns when called on substream 67 but for substream 68, it tries to match it to a different substream and never try again later. I’ve seen substreams where the filter function “fails” and then “succeed” to match the substream at some point.

I can’t know if that’s the issue for sure but that’s where I am right now and I don’t have much more ideas. I’m not even sure that my issue comes from here and that’s why I opened this issue.

Unfortunately I can’t share a reproducer as I’m not sure how to reproduce exactly. Like I said at the start, I first start by adding ~1/2k providers records to the DHT so it creates that much substreams too and it works fine.

Would you happen to have any ideas as to where I should look next to continue debugging this?

Thanks!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 21 (21 by maintainers)

Commits related to this issue

Reproducer for https://github.com/libp2p/rust-libp2p/issues/1629 — committed to BlackYoup/libp2p-handler-timeout-reproducer by BlackYoup 4 years ago
Reproducer for https://github.com/libp2p/rust-libp2p/issues/1629 — committed to BlackYoup/libp2p-handler-timeout-reproducer by BlackYoup 4 years ago

Most upvoted comments

So the good news is that I think I found the reasons for the sporadic upgrade timeouts (see https://github.com/libp2p/rust-libp2p/pull/1781). Regarding the other two new issues you describe, I’m still looking into that but I’ve been confused so far by the fact that there was a period where I could trigger them rather easily, but only with async-std, not with tokio, and periods where I cannot reproduce them. During the time where I could reproduce them, it looked like there was a Pending poll_read on the TCP transport level which resulted in the still “pending” connection background task to be Pending as well, waiting for a multistream-select message, without being woken up again, despite seeing the message delivered on the TCP level via tcpdump / wireshark. I will continue to look into that. Thanks again for your test setup.

romanb on Sep 30, 2020

I think all issues here have been resolved. Feel free to re-open if you still encounter the problems discussed here so far or open new issues. Thanks again for your detailed reports and sample code.

romanb on Nov 13, 2020

Another weird thing is that, on my reproducer, both using yamux and mplex on the current master, the swarm is sometimes not properly working. When I start my two instances, both peers seem to connect fine to each other but some things are off: […]

I’ve probably stumbled upon the cause for these new early connection negotiation stalls, which seem related to the plaintext protocol. See https://github.com/libp2p/rust-libp2p/pull/1831. I hope that this was the last problem to resolve here 🙏.

romanb on Nov 10, 2020

Thanks for the feedback, I will take another look at your setup.

romanb on Sep 28, 2020

Since https://github.com/libp2p/rust-libp2p/pull/1769 addressed some problems with mplex, it would be great if you could check whether the issues you ran into with mplex still occur with your test setup when using rust-libp2p/master. Since I personally wasn’t very successful at reproducing these issues earlier, I’m not giving much on the fact that it “works for me”, so your feedback would be very valuable.

romanb on Sep 28, 2020

Hello everyone,

I tested my reproducer with yamux instead of mplex. It doesn’t seem to trigger the bug for the ~10 tests I ran. I quickly tried on my internal project but unfortunately I’m having another issue, probably due to how I send messages between peers. Not sure if it is due to yamux or my debug code but I won’t have much time to dig into it those next few days. Sorry about that.

BlackYoup on Jul 15, 2020

If I remember correctly, it was about the fact that I wanted to send big chunks over the network to another peer and yamux would not handle it correctly, leading to data not being sent.

I do not want to hijack this issue thread so it would be best to discuss this elsewhere but I would be keen to know what exactly was not handled correctly.

twittner on Jul 2, 2020

Hello,

So I’ve come up with a reproducer. Since I knew more about the issue, I knew what to do to trigger it again. You can find the repository here: https://github.com/BlackYoup/libp2p-handler-timeout-reproducer

It’s still a handler upgrade timeout but during multiple PUT of providers into the DHT, not during GET PROVIDERS requests. The PUT follow the same pattern, one at a time.

I unexpectedly ran into it when iterating the reproducer and, on the opposite, couldn’t reproduce my issue when using GET PROVIDERS requests. Anyway, hopefully the same thing happens.

The readme contains most of the stuff to reproduce and how I start it on my machine, so ask me any questions about it if needed. I’m using rust 1.44.1.

Once the issue here is resolved I would love to hear about the problems you had with yamux. Maybe you could open an issue at https://github.com/paritytech/yamux or write me an e-mail?

Sure. It’s on my todolist to report this issue. I’ve hit the issue 2/3 months ago and my knowledge was more limited so I would have to retry. If I remember correctly, it was about the fact that I wanted to send big chunks over the network to another peer and yamux would not handle it correctly, leading to data not being sent. When I switched to mplex, I used

let mut mplex = MplexConfig::default();
mplex.max_buffer_len_behaviour(MaxBufferBehaviour::Block);

Which solved my problem. I believe this is the same issue as explained in https://github.com/libp2p/rust-libp2p/issues/1257#issuecomment-539524128 and from which I then moved to use mplex.

BlackYoup on Jul 2, 2020