go-libp2p-kad-dht: Occasional blocking when using Provide()

I have a worker node specialized in DHT Provide-ing based on go-ipfs core with the last DHT released (go-libp2p-kad-dht v0.5.0, go-ipfs 8e9725fd0009 (master from Feb 7)). What I do on this worker is a highly concurrent Provide (up to 5k concurrent tasks) to publish a high number of values in a reasonable amount of time.

While this initially work quite well, I found out that some Provide get stuck over time, even though I pass them a context with a 5 minutes timeout. To detect that problem I setup an extra goroutine for each task to check at 5 minutes 30s if the Provide returned.

Here is an example run:

As you can see, there is indeed stuck Provide. As those never return, they eventually bring down the publish rate as they don’t pick up new value to publish. Btw, don’t be so concerned by the progressive ramping up at the beginning, it’s just the worker spawning new tasks gradually to avoid congestion.

To track down further this problem I wrote some code to, once a first blocking is detected, change the concurrency factor to zero. The result is that all the sane Provide complete and get removed, leaving only the blocked ones. When this happen, I found that hundreds of those are left, even though the teardown start as soon as the first is found.

After the all the sane Provide returned, I took a goroutine dump:

raw goroutine dump deduped with goroutine-inspect (notice the duplicate count for each stacktrace)

Here is what I could infer from that:

13k goroutines left
still a LOT of stuck tasks according to the logs
2827 in bitswap
1730 in DHT’s handleNewMessage --> go-msgio --> read IO
6 in DHT’s dial queue
2831 in yamux
118 in bitswap’s blockstoremanager
2812 in another place in yamux
2830 in swarm c.conn.AcceptStream()

A few points:

it happens more with high concurrency but I can’t tell if it’s because of the concurrency or simply because there is more occasion to trigger the failure
once a first blocking happen, it’s likely that more will follow quickly. Again, this can be a cascade of event or simply because the condition to trigger are met.
despite all of that, Provide never return an error.
this could simply be a problem in the code I use to detect the blocking but this code looks correct and detected blocking and decrease in publish rate correlate really well
that said, I couldn’t find the place where the Provide actually block. I suspect a congestion or deadlock of some sort but can’t pinpoint where.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 29 (29 by maintainers)

Commits related to this issue

fix: obey the context when sending messages to peers Related to #453 but not a fix. This will cause us to actually return early when we start blocking on sending to some peers, but it won't really _u... — committed to libp2p/go-libp2p-kad-dht by Stebalien 4 years ago
fix: obey the context when sending messages to peers Related to #453 but not a fix. This will cause us to actually return early when we start blocking on sending to some peers, but it won't really _u... — committed to libp2p/go-libp2p-kad-dht by Stebalien 4 years ago

Most upvoted comments

A few days ago I noticed that I had a bug in my auto-scaling code resulting in occasionally not counting properly the tasks terminating, so I entirely rewrote this part into a much cleaner and more importantly, well tested package to remove that variable from this equation. While doing that I also wrote better telemetry and better stuck task detection. Here is what I found out.

Provide() calls do get stuck, but not in the sense that I thought. They do outlive the context significantly but eventually return. As you can see in the following, there is no stuck task piling up over time, but a significant number outlive their context:

Here is a sample distribution of how long it takes to return after the context done:

Not surprisingly, using a shorter timeout (40s) significantly increase the number of stuck tasks. It looks like they just finish naturally, without any care for the context deadline. @Stebalien I see your fix/observe-context-in-message-sender branch, I’ll give it a try.

Despite different scaling parameters and timeout, all the workers follow a fairly similar pattern in term of publishing speed:

This pattern is the result of the auto-scaling doing it’s thing:

This scaling down is to keep a relatively constant CPU usage:

Eventually, this means that Provide() globally require relatively suddenly more CPU to do the same thing (again, the correlation in the timing is surprising):

During all that, the time it takes to publish a single value stay quite constant:

Just for the sake of completion, memory usage stay constant once it reaches a plateau:

Another thing I found out: the OpenCensus metrics in the DHT are misleading, at least to me. ReceivedMessages, SentMessages and SentRequests record the number of valid messages or requests due to the early return on error. So for example graphing a ratio error/total will be completely bogus (notice how it goes over 100%).

To finish, here is a deduped goroutine dump of a worker under load, past the point of increased CPU usage: goroutines-on-load-dedup.txt

… and two pprof CPU profiles: one from a recently launched worker, one from a long running one. profile-short-running.txt profile-long-running.txt –> go tool pprof --http=:8123 -nodecount=10000 /tmp/profile-long-running.txt

MichaelMure on Feb 24, 2020