selenium: java.util.concurrent.TimeoutException thrown at random netty read timeouts with RemoteWebDriver
š Bug Report
Netty at random times gets a read timeout at. This happens at different selenium commands ( for example: WebDriver.switchTo().defaultContent, WebElement.click, WebDriver.switchTo().window, WebElement.sendKeys, WebDriver.get, Alert.accept ) and at random in a quite small percentage chance (<1% test cases).
To Reproduce
I donāt have specific steps to reproduce. When our CI runs our test suite of thousands of tests run, about 10 fails at random due to this timeout. I could not reproduce by doing a simple long loop with a few commands on my development workstation.
Timeout details
This timeout always occurs at:
Caused by: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
at org.asynchttpclient.netty.NettyResponseFuture.get(NettyResponseFuture.java:206)
at org.openqa.selenium.remote.http.netty.NettyHttpHandler.makeCall(NettyHttpHandler.java:65)
I could confirm that it took 3 minutes there, confirming that it is due to the default 3 minutes read timeout the selenium configures the netty with. But the commands that are timing outs would normally run very fast, much less than one second.
Trying the code below in a method called probably thousands times by my test suite, it failed entering the catch. But after it called again driver.switchTo().defaultContent() at the end of the code below it worked. So it seems that although the read timeout happens in netty, it still works normally afterwards.
try
{
driver.switchTo().defaultContent();
}
catch (TimeoutException e)
{
// this should never happen, but started happening at random after updating to selenium 4
// output information to help troubleshoot
System.err.println("TimeoutException thrown while trying to go to defaultContent (stack below). Trying again...");
e.printStackTrace();
try
{
Thread.sleep(5000);
}
catch (InterruptedException e1)
{
}
driver.switchTo().defaultContent();
}
In this case, the stack trace got by the e.printStackTrace()
above was:
org.openqa.selenium.TimeoutException: java.util.concurrent.TimeoutException
Build info: version: '4.0.0-beta-3', revision: '5d108f9a67'
System info: host: '51e5404d333b', ip: '172.18.0.7', os.name: 'Linux', os.arch: 'amd64', os.version: '3.10.0-1127.19.1.el7.x86_64', java.version: '11.0.1'
Driver info: org.openqa.selenium.remote.RemoteWebDriver
Command: [a5e3bf25-ba72-4023-b219-76406cf58660, switchToFrame {id=null}]
Capabilities {acceptInsecureCerts: true, browserName: firefox, browserVersion: 88.0, javascriptEnabled: true, moz:accessibilityChecks: false, moz:buildID: 20210415204500, moz:debuggerAddress: localhost:46562, moz:geckodriverVersion: 0.29.0, moz:headless: false, moz:processID: 9286, moz:profile: /tmp/rust_mozprofileQJRwQP, moz:shutdownTimeout: 60000, moz:useNonSpecCompliantPointerOrigin: false, moz:webdriverClick: true, pageLoadStrategy: normal, platform: LINUX, platformName: LINUX, platformVersion: 3.10.0-1127.19.1.el7.x86_64, rotatable: false, se:cdp: ws://172.18.0.3:4444/sessio..., se:cdpVersion: 85, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify}
Session ID: a5e3bf25-ba72-4023-b219-76406cf58660
at org.openqa.selenium.remote.http.netty.NettyHttpHandler.makeCall(NettyHttpHandler.java:71)
at org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)
at org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:56)
at org.openqa.selenium.remote.http.netty.NettyHttpHandler.execute(NettyHttpHandler.java:51)
at org.openqa.selenium.remote.http.AddSeleniumUserAgent.lambda$apply$0(AddSeleniumUserAgent.java:42)
at org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:56)
at org.openqa.selenium.remote.http.netty.NettyClient.execute(NettyClient.java:103)
at org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:55)
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:181)
at org.openqa.selenium.remote.TracedCommandExecutor.execute(TracedCommandExecutor.java:39)
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:619)
at org.openqa.selenium.remote.RemoteWebDriver$RemoteTargetLocator.defaultContent(RemoteWebDriver.java:1097)
(...)
Caused by: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
at org.asynchttpclient.netty.NettyResponseFuture.get(NettyResponseFuture.java:206)
at org.openqa.selenium.remote.http.netty.NettyHttpHandler.makeCall(NettyHttpHandler.java:65)
... 38 more
Environment
OS: Docker containers inside a CentOS Browser: RemoteWebDriver using Firefox in selenium/standalone-firefox:4.0.0-beta-3-20210426 docker image. Also tried the selenium/standalone-firefox:4.0.0-beta-4-prerelease-20210527 docker image, but the same thing happened. Browser Driver version: RemoteWebDriver from selenium-java 4.0.0-beta-3 Language Bindings version: Java 4.0.0-beta-3 The RemoteWebDriver runs in a container that is running in the same docker host as the browser container. So all network between them is only logical in the same machine. Previously we were using Selenium 2.52, in the same docker host, and never happened anything similar to such timeout.
Do you have any tips about what I can try to fix it or investigate more about this?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 18
- Comments: 165 (50 by maintainers)
Commits related to this issue
- [http-client] Allow AsyncHttpClient to manage the lifecycle of the timer. Related to #9528. — committed to SeleniumHQ/selenium by pujagani 3 years ago
- [grid] removing `content-length` header for `GET` requests This helps https://github.com/SeleniumHQ/selenium/issues/9528 because Netty tends to timeout when a GET has a `content-length` header (which... — committed to SeleniumHQ/selenium by diemol 3 years ago
- [grid] Enabling request retries for session handling Helps with #9528 — committed to SeleniumHQ/selenium by diemol 2 years ago
- [java] Deactivating automatic retries in the HTTPClient Helps with #9528 — committed to SeleniumHQ/selenium by diemol 2 years ago
- [grid] Enabling request retries for session handling Helps with #9528 — committed to elgatov/selenium by diemol 2 years ago
- [java] Deactivating automatic retries in the HTTPClient Helps with #9528 — committed to elgatov/selenium by diemol 2 years ago
- Retry failures related to https://github.com/SeleniumHQ/selenium/issues/9528 also in cleanup methods — committed to geb/geb by erdi 2 years ago
Thank you @Cybermaxke for the workaround. This has been included already by @pujagani through #10220.
We are planning a 4.1.2 release for next week (which includes this fix).
@JulienBreton I was checking the attached log and I am confused about how the test is sending those commands. I followed the command for session
f24ae2d1ba0f0d398cdaa093ef912389
ā¦POST '/session/f24ae2d1ba0f0d398cdaa093ef912389/timeouts'
which does not seem to get a response.POST /session/f24ae2d1ba0f0d398cdaa093ef912389/timeouts HTTP/1.1
which does not seem to get a response.DELETE /session/f24ae2d1ba0f0d398cdaa093ef912389 HTTP/1.1
which gets completed at 12:48:04.062DELETE /session/f24ae2d1ba0f0d398cdaa093ef912389 HTTP/1.1
which gets completed at 12:48:03.995.But thenā¦
http://172.18.0.4:5555/session/f24ae2d1ba0f0d398cdaa093ef912389/timeouts
.POST /session/f24ae2d1ba0f0d398cdaa093ef912389/timeouts HTTP/1.1
It seems our HTTP client is doing some extra retries we were not aware of (
NettyRequestSender.retry
) which I will deactivate in a moment. In this case, those automatic retries are happening after the session has been deleted. And yes, we need to switch to another HTTP client soon.Deactivating those automatic retries should reduce or remove the amount of timeouts happening, and the let us rely on the configured
session-timeout
plus the retries that only happen between Hub and Node. I think we will do a new release in less than 1 week.We are making some changes to the retry mechanisms since it might be hiding some issues. This will be part of 4.1.3
Weāre also seeing 5-10% failures due to this issue.
Hereās @Cybermaxkeās spectacular workaround verbosely written in Java. Would really prefer this to just be configurable or fixed.
@diemol with the 4.6.0 I donāt face issues anymore. The new Java 11 HTTP client and the fix for #11156 make Selenium 4 stable (in my case at least). Thanks a lot !
@diemol I can reproduce the java.util.concurrent.TimeoutException with the 4.2.0
I add a file with more logs. selenium_4.2.0_logs_3005.txt
Thank you for the feedback. I will close this issue and if something related comes up, please open a new one so we can address any variant individually.
@JulienBreton @evertones In 4.1.3 we removed the behavior of having retries everywhere by default. So it is not strange to see the error again, but the advantage now is that with your stacktraces we see where the timeout is exactly happening.
Before I go and apply a fix, would it be possible to get more detailed logs? (with
--log-level "fine"
)Weāre also seeing the same issue, when running about 700 tests, 20 will randomly fail. We have a grid setup (version 4.1.0) with 10 chrome nodes on AKS.
As a workaround, I was looking into a retry mechanism which caused me to stumble upon
RetryRequest#readTimeoutPolicy
, which never gets to retry due to its max duration. So I adjusted the read timeout of the remote driver and maximum duration policy to trigger some retries. Up to now, no action has failed twice in a row due to this timeout.The workaround if anyone is interested:
Do you know if there is a workaround for this issue ? Selenium 4 mades the tests flaky, we canāt use it.
Do we have anything in our docs about it, though? Blog post is good for explaining *why, but should be a straightforward what in the docs?
@JulienBreton As you told I am not getting java.util.concurrent.TimeoutException anymore when using Java11+http client instead of async http client. But getting this org.openqa.selenium.TimeoutException: java.net.http.HttpTimeoutException: request timed out Same exact scenes like how random java.util.concurrent.TimeoutException occurs, similarly now I am seeing this org.openqa.selenium.TimeoutException: java.net.http.HttpTimeoutException: request timed out exception.
This is the log I am seeing when testcase fails
@diemol @titusfortner please throw some light on why this exception occurs Thanks!
Does anyone have a port of this workaround for python? I was hoping for a fix or a feature to enable this in the core codebase.
@diemol Any update on the next release which contains the above fix?
Hi, I canāt reproduce the issue with the
java.util.concurrent.TimeoutException
. My tests work well now. Thanks for your hard work and this awesome tool !Do we have any update on this issue? We are facing this issue in our project and we are stuck for a long time.
@manu141 they do not help because you are using an old version of Grid.
@JulienBreton thatās right ā¦even we are able to reproduce @diemol the same issue on IE Edge mode in selenium grid-HUB and NODE configuration and itās a huge blocker for us. As we run about 3000 tests during our regression time on and currently we went back to IE in some cases. We did try to limit nodes from 10 to 3 we saw some improvement but it happen at least on one node in 3 and tests fail on that nodeā¦no alternate solution we had to restart that test and node.
@amyreit in the 4.1.2 there is a temporary fix but this fix is removed since 4.1.3 for a better way to solve the issue. The issue occurs always in 4.1.4, then it seems the team needs to add a fix (maybe it is already the case in the trunk).
Just another piece of information - I moved from 4.0 to 4.1.2 standalone firefox in early April and the occurrence of the issue has dropped significantly. From several runs a day to once every week.
I can see the same exception mentioned by @JulienBreton. I am using:
4.1.3-20220327
Logs in the
hub
Logs in the
node
(node-chrome
)Above I found really strange the messages in the last 2 lines to be messed up. It looks like a concurrency issue printing them out.
Hi @diemol I am sorry to say that but with 4.1.3 the issue with the
java.util.concurrent.TimeoutException
is back. I have run many times my tests with 4.1.2 and I have not encountered the problem. With 4.1.3 I have the problem several times.My config : Selenium Java 4.1.3 docker-selenium : 4.1.3-20220327
Awesome, thank you @diemol , any rough time line for 4.1.3 release?
import org.openqa.selenium.remote.http.RetryRequest;
@JulienBreton , in my case I use the Java Selenium client. The workaround I found for my case is to use the Selenium 4 client (I am still using a beta, I did not upgrade to the final yet, but I guess it will work too) connecting via RemoteWebDriver to the Selenium Server version 3 docker (example:
selenium/standalone-firefox:3.141.59-20210422
).Seems we could use this repo to reproduce the issue. https://github.com/kasperSiteimprove/GridTimeoutRepro
I can see it every time I run the suite of tests that I have. Out of ~320 specs, usually there are random 32 or 33 specs that fail with the
org.openqa.selenium.TimeoutException
. Every time I run, different specs fail with this exception.Environment
Hub/Nodes
in a Linux server with 88 cores using KubernetesSelenium 4.0
(final release)Logs on the Hub
Logs on the Node