workers-sdk: 🐛 BUG: "[ERROR] Error in ProxyController: Error inside ProxyWorker"

Which Cloudflare product(s) does this pertain to?

Wrangler core, Miniflare

What version(s) of the tool(s) are you using?

3.19.0 [Wrangler]

What version of Node are you using?

20.10.0

What operating system are you using?

Linux

Describe the Bug

Repeatedly calling an endpoint calling a Durable Object will result in this error every other request:

✘ [ERROR] Error in ProxyController: Error inside ProxyWorker

   {
    name: 'Error',
    message: 'Network connection lost.',
    stack: 'Error: Network connection lost.'
  }

Not sure if the cause is actually repeatedly calling a DO. In the DevTools, requests to the DO appears to be all successful.

Downgrading to 3.18.0 fixes this issue, so this is possibly a regression involving the startDevWorker refactor.

Please provide a link to a minimal reproduction

No response

Please provide any relevant error logs

No response

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 16
  • Comments: 40 (16 by maintainers)

Most upvoted comments

hi @aroman and all, apologies for the delayed action on this issue. our team got pulled into high priority internal work over the last several weeks and we fell behind on our regular workers-sdk maintenance. i appreciate you calling out our engagement with the community as a positive – we strive to keep you all informed as much as possible. this is a good reminder for us to continue to communicate out any internal discussion we have about particular issues so you all are always up to date on their status.

in terms of concrete next steps, we have prioritized this issue for this week and assigned someone to address it. and while we’re also taking strides to reduce the number of regressions by increasing test coverage, going forward we’ll be prioritizing fixing any regressions that do slip through as quickly as possible – we have also just added a regression label so that items such as these get highlighted, please feel free to use it 😃

thanks for raising this feedback!

@admah even the Counter example in DO docs has this issue, with any wrangler above version 3.18. It does not require any concurrent requests. I can reliably reproduce the issue using hand triggered http requests if they are less than say 3 seconds apart. And the behavior is very consistent, always 1 working request followed by 1 broken request and repeat.

@aroman that is a good callout. We are constantly reviewing our processes to see what we can do to mitigate these types of incidents, because we do understand how disruptive they are.

For this issue (and any others related to previous startDevWorker work) we have prioritized them and are working to have them resolved ASAP.

I found a way to reliably reproduce this on Windows @RamIdeas (and some success on MacOS). I filed https://github.com/cloudflare/workers-sdk/issues/5095 to track separately.

Thanks @matthewjosephtaylor. They unfortunately aren’t consistent for one of my team. They can send 100 requests, 98 are fine, and then 2 of them throw the new 503 introduced in #4867 😔

At the risk of being the wrong signal here, but I see more failures when a CORS (OPTIONS) request is happening? The app has access to queues and KV, not D0 like others are using above.

Screenshot 2024-01-10 at 2 00 56 PM

Confirmed on my development environment when using CORS but in deployed environment it is working fine.

@RamIdeas We actually migrated off Cloudflare Workers not long after this issue, so my memory on this issue may not be the best.

could you clarify whether this is indeed an error causing your local development environment to behave incorrectly or whether it is a noisy log message in your terminal?

It’s not just noisy log message. Every other HTTP request to the endpoint will error out with that error message.

I tried to reproduce this issue now using a stripped down version of our old code but also couldn’t reproduce it now (repeatedly calling a DO). It’s possible that the root issue lies on other interactions but I couldn’t recall what exactly.

Work-around:

I’m using DurableObjects with WebSockets.

As long as I have the client continually sends a ‘ping’ message across the socket every 5 seconds I’m no longer experiencing the issue.

I STRONGLY suspect the issue is with wrangler hybernating the worker. If I never let it rest, I don’t get the error.

If you’re talking about websocket hibernation API, we don’t use those APIs in our project (we use WebSockets + DOs but not the hibernate APIs), and we are still hit hard by this issue when developing locally and running tests.

Separately, for CF folks (cc: @lrapoport-cf), I’d like to underscore the point @beanow-at-crabnebula made above. At my company, we’ve been unable to upgrade wrangler since early November, due to one regression after another (3 by my count: this one, #4496, and one caused by #4535 I can’t seem to find). Of these, only the latter was documented in release notes.

I appreciate how frequently wrangler releases new versions, but my feeling is that something has got to change process-wise to mitigate the frequency and severity of breaking regressions — or at least document them in release notes so folks are aware of the tradeoffs when upgrading. I appreciate how active and interactive with customers your team is on the CF github repos — that level of communication goes a long way.

Work-around:

I’m using DurableObjects with WebSockets.

As long as I have the client continually sends a ‘ping’ message across the socket every 5 seconds I’m no longer experiencing the issue.

I STRONGLY suspect the issue is with wrangler hybernating the worker. If I never let it rest, I don’t get the error.

Just adding a note that we now have critical CVEs against the only workaround: downgrading to wrangler 3.18 https://github.com/cloudflare/workers-sdk/security

Upgrading to latest wrangler also caused Jest to find leaking request handles. This makes the unstable_dev for integration tests: as documented. Very brittle.

I’ve started seeing test flakes which are based on lots of unstable_dev starts. Without DO bindings (1 KV and 1 R2 binding though).

Reverting to 3.18.0 indeed solves the issue.

Node 18, Linux, if that matters.