testcafe: Unstable error "Unable to establish one or more of the specified browser connections" after Chrome update (v83)

What is the Test Scenario?

TestCafe tests in TeamCity against headless Chrome in parallel.

What is the Current behavior?

After Chrome update to v83 TestCafe tests periodically (in about half the times) do not start with the following error:

GeneralError: Unable to establish one or more of the specified browser connections. This can be caused by network issues or remote device failure.
      at BrowserSet._waitConnectionsOpened (E:\BuildAgent\work\6d1660a0b0f4fce5\testcafe\node_modules\testcafe\src\runner\browser-set.ts:91:30)
      at E:\BuildAgent\work\6d1660a0b0f4fce5\testcafe\node_modules\testcafe\src\runner\browser-set.ts:114:35
      at processTicksAndRejections (internal/process/task_queues.js:94:5)
      at Bootstrapper._getBrowserConnections (E:\BuildAgent\work\6d1660a0b0f4fce5\testcafe\node_modules\testcafe\src\runner\bootstrapper.ts:215:16)
      at async Promise.all (index 0)
      at Bootstrapper._bootstrapParallel (E:\BuildAgent\work\6d1660a0b0f4fce5\testcafe\node_modules\testcafe\src\runner\bootstrapper.ts:391:38)
      at Bootstrapper.createRunnableConfiguration (E:\BuildAgent\work\6d1660a0b0f4fce5\testcafe\node_modules\testcafe\src\runner\bootstrapper.ts:424:42) {
    code: 'E1004',
    data: []
  }
  npm ERR! code ELIFECYCLE
  npm ERR! errno 1
  npm ERR! test.testcafe@1.0.0 test:teamcity
  npm ERR! Exit status 1
  npm ERR! 
  npm ERR! Failed at the se.test.testcafe@1.0.0 test:teamcity script.
  npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
  npm ERR! A complete log of this run can be found in:
  npm ERR!     C:\Users\TC_BuildService\AppData\Roaming\npm-cache\_logs\2020-06-16T13_04_02_492Z-debug.log
  Process exited with code 1
  Process exited with code 1 (Step: Tests (Command Line))
  Tests (Command Line) failed

Before the update these errors were very very rare.

It happens not only in TeamCity, but on local run too (less often).

Using chrome:headless:userProfile or/and chrome:headless --no-sandbox didn’t help.

What is the web application and TestCafe test code?

Parameters:

  • target browser: chrome:headless
  • concurrency level: 6
  • hostname: localhost
  • port1: 1337
  • port2: 1338
  • skipJsErrors: false
  • skipUncaughtErrors: true
TestCafe runner code:
const createTestCafe = require('testcafe');
const path = require('path');
const config = require('./.testcafe.config');

let testcafe = null;
createTestCafe(config.hostname, config.port1, config.port2)
    .then(tc => {
        testcafe = tc;

        const runner = testcafe
            .createRunner()
            .browsers(config.browsers)
            .src(config.src)
            .concurrency(config.concurrency)
            .reporter(config.reporter);

        return runner.run({
            ...config.runnerOptions,
            quarantineMode: true
        });
    })
    .then(function(failedCount) {
        testcafe.close();
        process.exit(failedCount ? 1 : 0);
    })
    .catch(function(error) {
        console.error(error);
        testcafe.close();
        process.exit(1);
    });
Custom configuration file:
{
    hostname: 'localhost',
    port1: 1337,
    port2: 1338,
    browser: 'chrome:headless',
    src: './tests/*.js',
    concurrency: 6,
    reporter: 'teamcity',

    screenshots: {
        fullPage: false,
        takeOnFails: false
    },

    runnerOptions: {
        skipJsErrors: false,
        skipUncaughtErrors: true,
        pageLoadTimeout: 15000,
        selectorTimeout: 6000,
        assertionTimeout: 6000
    }
}

Environment details:

  • testcafe version: 1.8.6
  • node.js version: 12.14.1
  • browser name and version: Chrome 83.0.4103.106 / Windows 10
  • platform and version: Microsoft Windows Server 2016 Standard
  • TeamCity: 2019.2.4 (build 72059)
  • testcafe-reporter-teamcity: 1.0.10

Comments

I’m sorry I can’t provide a public link or a stable repro.

Please tell me which parameters I should pay attention at and which test configuration I should try. Could it be really related to last Chrome update?

If you need additional info I’m happy to provide it.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 21 (8 by maintainers)

Most upvoted comments

I’ve identified the root cause in my case. If CPU usage on my machine exceeds 70-80% this error occurs. In case of lower CPU usage TestCafe works as expected.

Hello @bryg217,

I’m glad this is helpful. Further feedback on whether the updates will solve the problem with stability would be much appreciated.

Hello, @bryg217,

Quarantine mode doesn’t retry the connection when it’s not being established during the timeout. This mode is designed for retrying unstable tests - it’ll not be helpful with an unstable browser connection.

We changed the error message so that it provides more information in the context of the following PR: Improve the message shown by “Unable to establish connection” error. Also, in this PR we introduced the functionality to specify the timeout during which browser connection should be established. The changes from the PR are not yet released/reflected in our documentation, but you can already test them and see if increasing the default timeout by specifying “–browser-init-timeout” flag resolves the problem with stability. You can install it by running the following command:

npm install https://git.io/JLQ2k

Thank you for the provided information. While your tests can pass sometimes even when CPU, RAM or disk is overloaded, the stability of your test system is very unpredictable. You can try to use the minimal concurrency level (1) and see if it helps.

It is not a problem of CPU or RAM in my case. The issue happens also when the machine is not overloaded. Also I don’t want to execute tests with the minimal concurrency because I would like my test to be run in reasonable times. Even if it worked It’s just a workaround not the real answer to the problem.

Hi. Same for me. Constantly face the issue with Unable to establish one or more of the specified browser connections. This can be caused by network issues or remote device failure error. Please, help to find out the workarounds.

Even with DEBUG=hammerhead*,testcafe* and with aggressively debug() enhanced source, we were not able to discern why chrome is not processing the IDLE_PAGE content. If any chrome wizards lurk in here, we attempted to use the TestCafe (TC) “custom” testcafe browser, multiplex the chromium output, capture chrome.log & stdio from the foreground chrome process, but it didn’t prove super useful because we were unable to get log content from tabs/windows: https://stackoverflow.com/questions/66926607/how-can-i-run-chromium-in-the-foreground-and-capture-native-and-webview-logs . If we can reliably get tab level, network level, and window level events from the chromium cli & emitted logs, such information could perhaps improve robustness in the TC workflow.

The TestCafe initialization process can improve robustness, now, by changing the synchronization mechanisms executed between the browser process & node runner.

Current process:

  • TC launches chrome, pointing to a page
  • Chrome loads page
  • Page does I/O w/ TC server
  • Page processes init scripts, reporting back to server on each completion via POST
  • POST handler allows test execution to resume

A more robust approach would be:

  • TC launches chrome
  • TC immediately gets a CDP handle
  • TC injects serialized init scripts into chrome, using request/response control flow

Currently, various effects are executed & coordinate by careful alignment of assets. Success is achieved by optimistically expecting that each downstream, uncontrolled effect executes successfully. It is dangerous to pass ownership of initialization control flow to chrome, and chrome to the embedded webapp, as TC does not have hooks into either of these systems by the time runInitScripts is called. TC launches chrome, passes a URL, & 🤞 wishes both chrome and the downstream web-app the best of luck. What if TC managed the whole init process, vs implicitly marshaling that responsibility to these other (generally reliable, but currently failing) entities?

Current process (pseudo):

let testCafeWindowIsReadyDeferred = ... <actually implemented via initScriptsQueue>
TC (system 1) creates readiness deferred
TC launches Chrome (system 2)
Chrome tries to load a web app (system 3) 
Web-app attempts to _do lots of work_ in the window, which hopefully eventually lands a POST call into system 1, which settles the deferred
READY

Possible future:

TC (system 1) launches chrome (system 2)
TC navigates chrome to idle page (resolve/reject)
TC injects script, gets response (resolve/reject)
...loop, until all init scripts settled
READY

We are finding this error on the daily in chrome as well. We’ve gone nuts and added debug(…) statements everywhere 😃

  • _onIdle issues HTML to chrome
  • 45-60s later, the browser closes

the HTML idle page document either doesn’t make it to chrome (unlikely), or, the on page javascript is periodically failing, which prevents TC from bootstrapping itself.

I’d like to find a way to capture the local chrome output emitted localChrome.start(…). Anyone know if this is feasible?

I think we have found the reason.

@AndreyBelym

We were moving TestCafe to run on Google Cloud Run, and we were using always only concurrency 1 (just one browser at a time). It was strange to us that even one browser was failing with this error.

We have tried:

  • Closing browser after each run and cleaning env;
  • Running TestCafe on separate process than our main runner;
  • Killing the Google Cloud Run container after each test and using clean container;
  • Update TestCafe and using browserInitTimeout;
  • Creating new Dockerfile on Debian with newer chrome and better logging;

Nothing helped.

But… after bit of googling we have found that Puppeteer has similar issue. Googling for the solution for Puppeteer failed.

However, our Engineer started writing down everything that is different between good and bad starts, and he have found that ports were consistently same in failed runs.

So we have hardcoded the port like this:

return await createTestCafe('localhost',8081,8082)

After hardcoding it like that - everything works perfectly.

Maybe you could add some feature that checks if Port is available before trying to start process with this port?