roadrunner: [BUG] All workers ready, exec 0

Hi! First of all, thanks for RoadRunner! ❤️ It’s a great tool

We have some stability issues in one of our big websites. This has occurred many times in the past and we can’t identify a clear cause for what’s happening. RoadRunner suddenly stops responding to requests completely after several hours or days. rr workers reveals that all of the workers are ready with an EXECS of 0. The issue never resolves on its own until we do a rr reset.

I’m not sure if maybe sometimes the workers start incompletely, unable to process requests, without rr reloading them because they never reach any of the soft or hard limits. After some time they might be accumulating until there are no healthy workers left to process any requests. This is just a theory. The workers are displayed as ready.

The version of RR used:

rr version 2.3.0 (build time: 2021-06-11T14:54:08+0000, go1.16.4)

My .rr.yaml configuration is:

rpc:
  listen:     tcp://127.0.0.1:6000

server:
  command: "/usr/bin/php7.4 psr-worker.php"

http:
  address: 0.0.0.0:8082
  
  pool:
    num_workers: 48
    
    supervisor:
      watch_tick: 60s
      max_worker_memory: 200
      ttl: 84600s
      exec_ttl: 30s

We’ve just added error logging to our config now, so unfortunately I can’t provide any logs yet, I will update this report as soon as we have anything.

Do you have any suspicions what might be causing this? Are there any things other than logs that we can provide that might be helpful?

Thanks for your help!

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 20 (14 by maintainers)

Most upvoted comments

Without execTTL (or without supervisor at all) it should be working fine if you need a quick fix.

No need to update the beta, we will fix this issue before the next release (2.3.1, next Tuesday). Thank you for bringing this issue to us, we will solve it at the highest priority.

Oh sorry, just saw that message now. We’ll wait for the fix then and report if it resolves the issue. Thanks for the very quick reaction! We really appreciated it!

Got u, thanks. I’ve reproduced this issue, I’ll try to fix it ASAP. Priority number 1 at the moment.

@rustatian @wolfy-j Great! Thanks again to both of you. Hope you have a great weekend!

@rustatian Today was a holiday in our city so we were off. We’re probably gonna upgrade tomorrow in the evening. I’ll let you know how it goes 🙂

@rustatian @wolfy-j Great! Thanks again to both of you. Hope you have a great weekend!

Have a wonderful weekend too, and welcome to the RR/Spiral community 😃

does it make sense to mark this error as a warning in the future?

Writing twice into a closed ResponseWriter is an error. This should be fixed.

@iluuu1994 I found the issue, the fix will be on Tuesday (v2.3.1).

Depends. How stable is it? Unfortunately we can only reproduce it in the production environment. I guess it’s load-dependent.

It’s pretty stable, we will release the same version, but without beta postfix next week. It has few new configuration options like broadcast and reworked WebSockets with KV, so, if you don’t use these features, you may safely update.