google-cloud-ruby: Pub/Sub subscriber stops processing messages

I’m pretty sure I’m having the same problems described in https://github.com/googleapis/google-cloud-ruby/issues/2011. Didn’t want to reopen that issue, apologies if it wasn’t the right decision.

I run a very simple ruby consumer where I:

  • Setup a Subscriber and call start on it.
  • Boot another thread where I start a basic rack http server to expose some custom prometheus metrics.
  • Start an endless loop where I print some stuff every few seconds.

I observe that sometimes it just stops processing messages. For instance, you can see a couple pods where CPU usage just decreases for no reason while others keep going:

Selection_184

The http thread keeps working, and when the pod stops (because of autoscaling or whatever) it shuts down cleanly.

I added an autoscaler based on CPU a week ago, and that seemed to help, but after a week I see the same behavior. Before the autoscaler they would almost always get stuck unless I ran more consumers than I needed.

My questions, I guess:

  • Any idea why this is happening? Could something internal to the pubsub gem be crashing in a way that doesn’t raise anything? I’ve worked with ruby a long time and threads still make me shudder.
  • Any way we could cleanly detect if this happens? Would calling Subscriber#stopped? let us know if things have stopped processing?

I would rather not rely on “have messages been processed in the last N seconds” to know if a consumer is working fine, because we run all kinds of consumers in all kinds of environments (like staging) where I’m not sure if there’s going to be a solid amount of messages coming in.

Thanks a lot!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 63 (46 by maintainers)

Most upvoted comments

LOL, of course. Fixed, please pull latest from that branch.

I have updated the thread monitoring branch to work off of the code in latest release. It adds a monitor to watch the subscriber’s background thread and will restart it. It also adds some logging to help indicate more about what is going on. To use it add the following to your Gemfile:

gem "google-cloud-pubsub", github: "blowmage/google-cloud-ruby", branch: "pubsub/subscriber-threading-monitor"

To enable the logging add the following somewhere to your code:

require "logger"
require "google/cloud/pubsub"
Google::Cloud.logger = Logger.new STDOUT

When I updated this branch I moved where the logger is being set to Google::Cloud.logger. We don’t have any logging in Google Cloud currently, as the focus has been on logging in the grpc or google-api-client libraries. But I thought this would be nicer and wanted to try it out. Let me know if this level of logging would be helpful to you.

@blowmage Yup, pointed my test repo to master and it seems to churn through every message now regardless of raises. (Sorry my repo didn’t work for you).

I think we can maybe close this now.

Sorry I haven’t provided an update for a while. Time zones and holidays!

We just had one occurrence of this (that I could see). I had added a few more metrics to monitor the consumers using prometheus_exporter. So apart from gc runs and the like, I also added a couple custom things. First, the number of threads running/sleeping. And then I tried to see if I could monitor some things from the subscriber itself, and I ended up adding a couple gauges with subscriber.stream_pool.count and subscriber.stream_pool.map(&:inventory).sum(&:count).

All metrics look pretty normal until the time where it got stuck, where I saw a bunch of log messages like this: Generating Prometheus metrics text timed out. This comes from inside prometheus_exporter, and I think it means that some of my custom code is hanging. At the moment I think one (or both) of those custom metrics on the subscriber times out.

I don’t see much correlation between inventory and the consumer hanging either:

Selection_215

Metrics for inventory sops after it hangs, because of the timeouts I mentioned. I don’t know what any of this means, just bringing it up, might make sense to you.

I’m back to work tomorrow, I’ll try that branch then.

I am always calling acknowledge! in an ensure inside the listen block, by the way.