testcontainers-java: Intermittent couchbase failures in CI
For a long time we’ve been getting intermittent failures in CI, such as this:
shouldInsertDocument - org.testcontainers.couchbase.Couchbase4_6Test
com.couchbase.client.core.RequestCancelledException: Request cancelled in-flight.
at com.couchbase.client.core.endpoint.AbstractGenericHandler.handleOutstandingOperations(AbstractGenericHandler.java:686)
at com.couchbase.client.core.endpoint.AbstractGenericHandler.handlerRemoved(AbstractGenericHandler.java:667)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.callHandlerRemoved0(DefaultChannelPipeline.java:626)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.destroyDown(DefaultChannelPipeline.java:878)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.destroyUp(DefaultChannelPipeline.java:844)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.destroy(DefaultChannelPipeline.java:836)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.access$700(DefaultChannelPipeline.java:44)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline$HeadContext.channelUnregistered(DefaultChannelPipeline.java:1286)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:176)
at com.couchbase.client.deps.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:162)
at com.couchbase.client.deps.io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:821)
at com.couchbase.client.deps.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:776)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
at com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at com.couchbase.client.deps.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Unfortunately this is a nuisance for us and for contributors. I’ve tried suggestions such as this but it hasn’t improved reliability.
Can anyone with more Couchbase expertise help investigate?
To reproduce, run the test suites under modules/couchbase
repeatedly.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 43 (36 by maintainers)
Commits related to this issue
- Wait before CouchbaseContainer setup is run to reduce risk of apparent partial configuration/race conditions Also add trace level logging, used in diagnosis and possibly useful in future Possible fi... — committed to testcontainers/testcontainers-java by rnorth 5 years ago
- Stabilize CouchbaseContainer by merging the Socat command (#2081) Fixes #1453 — committed to testcontainers/testcontainers-java by bsideup 5 years ago
- change cron of Couchbase to run every 30 mins for testing This changes the cron of Couchbase for testing as discussed in; https://github.com/testcontainers/testcontainers-java/issues/1453#issuecommen... — committed to SudharakaP/jhipster-daily-builds by SudharakaP 4 years ago
Couchbase employee here, we will investigate this issue and come back to you as soon as possible.
@rnorth I’ve been experimenting with a container that does not use socat and I think I got somewhere. I’m not fully done with it yet but everything so far seem to work fine. I’ve got it working in a standalone lib right now but as soon as I got my POC completely working I’ll do a PR to this repository - if you are curious: https://gist.github.com/daschl/e32e05e6abc31e450f67c23fe30c3826#file-couchbasecontainer-java
Comparing to the current version:
I hope I can make more progress in december and will report back.
~I might have a fix for this in #2076. This involved an incredibly painful amount of trial and error 😭.~
Sorry, I was wrong - #2076 might improve some aspects of resilience, but there are still numerous failures when running on slower machines.
@SudharakaP I’ve been trying to spot the error in the log you pointed out but I could only see build warnings - can you point me to the test failure with stack trace/logs? then I can take a look 😃 thanks!
@SudharakaP looking at the logs, I do not see any actual errors (i.e. https://github.com/hipster-labs/jhipster-daily-builds/runs/600774837?check_suite_focus=true) … I only see
Is that suggesting it times out on the wait?
edit: oh, one other had an actual error in it:
@SudharakaP : sure, feel free to change the cron and merge it 😃
@SudharakaP thanks for updating! Do you have more logs of such failures? /cc @daschl
Update: it seems that https://github.com/testcontainers/testcontainers-java/pull/2106 stabilized our tests and now
@Flaky
works as intended.@daschl thanks so much! I’ve been having another unsuccessful attempt to fix things this afternoon, and was just about ready to throw in the towel. Your message is extremely good timing!
For what it’s worth, the current implementation has numerous places that deserve improvement - polling for creation of buckets, proper handling of non-200 HTTP response codes, longer connect timeouts, etc. The list is long. However many patches I’ve made, though, there are still an extraordinary number of different kinds of failures.
This is what makes me very worried about our chances of incrementally improving the reliability of this module, so a more drastic rework seems sensible.
I’ll really look forward to your next updates - thanks for your efforts.
😬 reopening - we still don’t have a fix yet.
I’ve been running a fairly stupid series of scripted tests to try and at least narrow down where the race is occurring, injecting
sleep
into various locations to see if it reduces the failure rate. So far (400+ runs exercising 14 differentsleep
locations plus @bsideup’s change), none has been flawless. The most promising, a 10s sleep just before bucket creation (containerIsStarted
), reduced the failure rate to a mere 4%, but it’s obviously not a fix either. We seem to have a concurrency bug that we’re not seeing, or more than one bug conspiring together.I’d be happy to share the 100s of MBs of logs I’ve gathered during this exercise 😂. One of the more pertinent sets of logs, from the most promising test case, are below:
Click to expand...
At this point I’m afraid I’m at a bit of a loss - we’re obviously missing something vital, and we’re using up a lot of time and energy to make little progress!
I’d be grateful for any and all suggestions, no matter how radical!
FYI I submitted #2081, it seems that there was a faulty
execStart
that did not wait for the completion.While working on it, I also discovered that if we remove
execStart
(which means that the ports will not be proxied) then the client fails with “request cancelled inflight”. @daschl I have a feeling that the error message is wrong, because if the destination port cannot be opened, it should clearly state that since there wasn’t any request to sendOne of the important aspects is that the Couchbase Server HTTP API is inherently asynchronous, so even if you get a response doesn’t mean it is immediately available. I wonder if it makes sense to define the explicit endpoints / “stages” the container code triggers and then I can go hunting for the right polling logic to see if it is actually in a stage to move on and declare it “done”.
With the
couchbase-resilience
branch on an old machine, I’m getting the following logs (long snippet): https://gist.github.com/rnorth/f61637b4be934d3745ecb35d28961f5bIt’s interesting that these failures occur waiting for a (supposedly created) bucket to appear. This failure looks different to other failures we’re seeing on CI.
@rnorth one thing I was thinking, since it happens some % - maybe it would work if we increase the poll wait time up from one minute? It might at least be worth a try.
The request cancellations are very likely a side effect of a problem, if the network is healthy then I suspect we are connecting to that server port and it’s not ready yet (kinda indicating polling longer waiting until it comes online too).
Are you able to reproduce this locally too or do you only see it in CI?
We get a mix of failures, but
com.couchbase.client.core.RequestCancelledException: Request cancelled in-flight
is by far the most prevalent. If we could only fix one issue, it would be this one.It occurs randomly for some % of CI builds and local builds, but unlike other random failures it is not helped at all by retries within a build (we have a JUnit rule to retry specific failed tests that we think are ‘flaky’).
This makes me think that there’s some statefulness at play that’s causing one failure to cascade and pollute subsequent tests. I was unable to work out why when I investigated this back in May.
Trying to triage this a bit. Am I correct that the initial exception reported in this ticket is different from the failing runs linked? In the linked run I see:
while this ticket reports
The first one looks like the server is not coming up in time until it stops waiting, the second one is completely different: the SDK sent a request into the cluster but while the request was in-flight on the socket the socket got closed- and as a result the client has no other option as to cancel the request.
Some additional information, if it can help. In JHipster project, we encounter the similar issue as mentionned by @SudharakaP.
In our CI, it works perfectly well with OpenJDK11 :
But, it failed randomly when using OpenJDK8:
thanks for the heads up, I will ping this back to the team.
@SudharakaP thanks for confirming, and I’m sorry to hear that.
It sounds like a couple of Couchbase experts may be willing to lend a hand - hopefully there’s a straightforward solution so that we can make this module more reliable. Fingers crossed!
Sent with GitHawk
This continues to be a problem; our tests fail even if retried. This gives me some doubt that the module is actually stable enough for real use.
I think we’re going to have to have to reach out to Couchbase/Couchbase community for help. In the worst case scenario, if we can’t fix it, I’m inclined to remove this module.