go-libp2p-kad-dht: Occasional blocking when using Provide()

I have a worker node specialized in DHT Provide-ing based on go-ipfs core with the last DHT released (go-libp2p-kad-dht v0.5.0, go-ipfs 8e9725fd0009 (master from Feb 7)). What I do on this worker is a highly concurrent Provide (up to 5k concurrent tasks) to publish a high number of values in a reasonable amount of time.

While this initially work quite well, I found out that some Provide get stuck over time, even though I pass them a context with a 5 minutes timeout. To detect that problem I setup an extra goroutine for each task to check at 5 minutes 30s if the Provide returned.

Here is an example run:

image

As you can see, there is indeed stuck Provide. As those never return, they eventually bring down the publish rate as they don’t pick up new value to publish. Btw, don’t be so concerned by the progressive ramping up at the beginning, it’s just the worker spawning new tasks gradually to avoid congestion.

To track down further this problem I wrote some code to, once a first blocking is detected, change the concurrency factor to zero. The result is that all the sane Provide complete and get removed, leaving only the blocked ones. When this happen, I found that hundreds of those are left, even though the teardown start as soon as the first is found.

After the all the sane Provide returned, I took a goroutine dump:

raw goroutine dump deduped with goroutine-inspect (notice the duplicate count for each stacktrace)

Here is what I could infer from that:

  • 13k goroutines left
  • still a LOT of stuck tasks according to the logs
  • 2827 in bitswap
  • 1730 in DHT’s handleNewMessage --> go-msgio --> read IO
  • 6 in DHT’s dial queue
  • 2831 in yamux
  • 118 in bitswap’s blockstoremanager
  • 2812 in another place in yamux
  • 2830 in swarm c.conn.AcceptStream()

A few points:

  • it happens more with high concurrency but I can’t tell if it’s because of the concurrency or simply because there is more occasion to trigger the failure
  • once a first blocking happen, it’s likely that more will follow quickly. Again, this can be a cascade of event or simply because the condition to trigger are met.
  • despite all of that, Provide never return an error.
  • this could simply be a problem in the code I use to detect the blocking but this code looks correct and detected blocking and decrease in publish rate correlate really well
  • that said, I couldn’t find the place where the Provide actually block. I suspect a congestion or deadlock of some sort but can’t pinpoint where.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 29 (29 by maintainers)

Commits related to this issue

Most upvoted comments

A few days ago I noticed that I had a bug in my auto-scaling code resulting in occasionally not counting properly the tasks terminating, so I entirely rewrote this part into a much cleaner and more importantly, well tested package to remove that variable from this equation. While doing that I also wrote better telemetry and better stuck task detection. Here is what I found out.

  • Provide() calls do get stuck, but not in the sense that I thought. They do outlive the context significantly but eventually return. As you can see in the following, there is no stuck task piling up over time, but a significant number outlive their context: image

Here is a sample distribution of how long it takes to return after the context done: image

Not surprisingly, using a shorter timeout (40s) significantly increase the number of stuck tasks. It looks like they just finish naturally, without any care for the context deadline. @Stebalien I see your fix/observe-context-in-message-sender branch, I’ll give it a try.

  • Despite different scaling parameters and timeout, all the workers follow a fairly similar pattern in term of publishing speed: image

This pattern is the result of the auto-scaling doing it’s thing: image

This scaling down is to keep a relatively constant CPU usage: image

Eventually, this means that Provide() globally require relatively suddenly more CPU to do the same thing (again, the correlation in the timing is surprising): image

During all that, the time it takes to publish a single value stay quite constant: image

Just for the sake of completion, memory usage stay constant once it reaches a plateau: image

  • Another thing I found out: the OpenCensus metrics in the DHT are misleading, at least to me. ReceivedMessages, SentMessages and SentRequests record the number of valid messages or requests due to the early return on error. So for example graphing a ratio error/total will be completely bogus (notice how it goes over 100%). image

To finish, here is a deduped goroutine dump of a worker under load, past the point of increased CPU usage: goroutines-on-load-dedup.txt

… and two pprof CPU profiles: one from a recently launched worker, one from a long running one. profile-short-running.txt profile-long-running.txt –> go tool pprof --http=:8123 -nodecount=10000 /tmp/profile-long-running.txt