cypress: Intermittent requests hanging, eventually crashing the tab/browser

Current behavior

This is an issue we’ve been seeing for some time intermittently in our CI pipeline, but we had a particularly clean / minimal example on Friday so this issue will specifically reference what we saw on that occasion.

We observed that one of our parallel runners had stopped emitting any output, suggesting that Cypress was stuck somewhere. Using VNC, we remoted into the display that Cypress uses to see what was going on, and what we saw was the browser spinning trying to load our tests. The browser was fully responsive, and refreshing the tab saw it getting stuck in the same place every time:

Selection_999(021)

In the network tab, we could see that it was the XHR request to $BASE_URL/__cypress/tests?p=integration/ci/glean/tasks.spec.ts which was getting stuck. Chrome listed it as ‘Pending’ - I’ve attached some screenshots of what we could see in the network tab itself.

Selection_999(022) Selection_999(023) Selection_999(024)

We were also able to reproduce in the Chrome console by firing off a manual fetch('$BASE_URL/__cypress/tests?p=integration/ci/glean/tasks.spec.ts') - the promise that was returned never completed. We did this about three times, before firing off one for a different spec file to see what would happen. Immediately as we did this, the browser completely crashed and we got a Javascript heap out of memory in the runner logs - see crashed output.txt.

We’ve been trying to narrow down this issue for some time now, and have held off raising a Cypress issue as we were concerned that it might be a regression in our own app. However, in this instance the problem was occurring before our site was even loaded, leading us to believe it’s a Cypress issue (specifically around the way that requests are proxied).

Desired behavior

No response

Test code to reproduce

Our cypress.json looks as follows, if it’s of any interest:

{
  "integrationFolder": "integration",
  "pluginsFile": "plugins/index.js",
  "screenshotsFolder": "screenshots",
  "videosFolder": "videos",
  "fixturesFolder": "fixtures",
  "supportFile": "support/index.ts",
  "chromeWebSecurity": false,
  "defaultCommandTimeout": 20000,
  "numTestsKeptInMemory": 0,
  "videoUploadOnPasses": false,
}

We are running Cypress in headed mode.

Cypress Version

9.5.1

Other

The browser in this case was Edge 100, but we’re seeing the same issues in Chrome as well.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 84 (48 by maintainers)

Most upvoted comments

Hey, just wanted to share an update on what we’ve been up to the last week or so.

I’ve just merged a PR that simplifies the test repo quite a bit more. There’s now just a single spec, which runs the same test in a loop 300 times. The test is a “failed login flow” - all it does is enter some credentials, hit ‘log in’ and expect an error to be shown.

One upside of this is that there’s no longer any need for any backend services, and so we’ve also added the local config file into source control (because there are no credentials needing to be wired anymore). This should mean that any collaborator can get up and running locally straight away 🎉

I’ve started experimenting with hosting our site in different ways to see if the problem remains or goes away. Currently our app is hosted in a k8s cluster behind traefik - I’m playing with a branch where we host it as a static S3 bucket instead on raw http. So far, I haven’t reproduced the hang on this branch (but would like to do more runs to be sure).

Just to give another update on this, we’re continuing to try and narrow down where the problem might be by gradually simplifying the test suite to see if the hang is still reproducible.

As part of this, we’ve made a public repo which runs the tests via github actions against a static branch that we’ll maintain for this purpose. It’s still reproducing the problem currently, see e.g. this build.

We’re happy to add folks from your side as collaborators if that sounds useful? Thinking it might be really helpful for you to have something to play with, and this way if you think of more diagnostics etc worth adding you could get the feedback more directly rather than having to go via us 😄

It’s been a while since I last gave an update, but I have some good news. We ran another experiment, this time to see what happened if we ran the same minimal scenario but using Playwright instead of Cypress. We rewrote a simple login test and did a bunch of builds with it running in a loop and we’re now very confident that we’ve run into the exact same problem.

It manifests slightly differently, because Playwright detects that their worker has stopped responding and spins up a new one in its place (we see Worker teardown timeout of 30000ms exceeded while tearing down "context". in the logs). The resulting failure videos are either unplayable files or long videos that show our app hanging - we see these exact same symptoms in our regular Cypress builds that get stuck too.

This means the problem must lie elsewhere, and it’s some interaction between our app and the browser that’s the root of the problem. The bad news (for us) is that we still don’t know exactly what, but we’re definitely getting closer. And it means I can close this issue and not waste any more of your time. Thanks for all the help you’ve given us with this, we’ve really appreciated it! 🏆

Ah I am looking at this docker image and we do install 4.1.0 globally. Seems weird that I was able to get the one with 4.1.0 installed in from lock to hang but 4.0.0 didn’t. Either way we are going to need to dig a bit more as to why that is. 35 runs without a hang is good. I’m just wondering as to why that is now 🤔

Okedoke, I’ve sent those through to the same email as before 😄

Cypress debugging log sent.

@alyssa-glean can you let us know if you find out the root cause between your app and the browser? We have been experiencing the same issues as you

A bit of a frustrating one today. I created a docker image that installs node 12 and 4.0.0 globally. I bumped the job timeout to 45 minutes and STILL had 1-2 hang on me. We did add a few additional flags from 4.0.0-4.1.0 in the docker image but I am starting to wonder if that even might be a red herring and we were just lucky with runs that did not hang.

I am wondering if we next try

  • running without the specified env variables in the docker image
  • downgrade to 3.x

5.0.0 hung on the first run. Downgrading to 4.0.0

Looks like the testing site is throwing a 404? Looks like the branch I am running is also up to date with main. I’ll try rerunning in a few hours and see if it’s back up.

Ah, apologies - it got brought down as part of a “cleanup” job we have. It’s back up and running now 👍

Looks like the hang is replicable on electron, so feel free to use this for your testing (might be easier to get debug info etc): https://github.com/glean-notes/cypress-testing/runs/7742320369?check_suite_focus=true

Yep, I’ve emailed a complete version of the file over now 😄

@alyssa-glean would you be able to add me as a contributor? I’d like to create a branch and try some things in github actions if that is OK with you all!

Apologies - the upgrade to Cypress 10 hasn’t been as straightforward as I’d hoped so it’s taking a little longer. Will hopefully have it rolled out next week and can update you then.

@alyssa-glean hmmmm interesting, the PR I tagged may not be the full solution to what you are experiencing then, however it might alleviate some issues like you said. FYI, 10.3.0 was just released earlier this afternoon.

Brilliant! I’ll get us upgraded today and report back 😄

@alyssa-glean My apologies, I haven’t had the time to pull the logs on this, but this stopped happening for us after updating our version of Chrome. We were running on a much older Chrome 89.0.4389.82 with Cypress v9 was causing this issue a particular test suite hanging in this way we bumped the version to 101.0.4951.54 (Which was the latest at the time, on may 4th 2022). And that seems to have resolved the issue.

No problem - glad you’ve found something that worked for you! We did a lot of bumping of browser versions when we first hit this, so far to no avail - but we’re a few stops short of 101.x.y.z so perhaps we’ll give that a go.

if possible, could you run Cypress with the debug logs turned on:

Sure, we’ll try this too and get back to you - we have tinkered with the debug logs a bit but not with these exact flags. We didn’t spot anything interesting in them but we’re definitely not the experts 😅