spring-cloud-sleuth: Parent can only be null in a local root
Describe the bug
I am using Spring Cloud Sleuth 3.0.1-SNAPSHOT (as of 2020-12-31). The bug is really sporadic but does cause the worfkflow engine (flowable, which I have instrumented manually) to not call the next tasks in the BPMN workflow.
The error is this:
Exception in thread "orders-flowable-async-job-executor-thread-4" java.lang.AssertionError: Bug (or unexpected call to internal code): parent can only be null in a local root!
at brave.internal.recorder.PendingSpans.getOrCreate(PendingSpans.java:89)
at brave.Tracer._toSpan(Tracer.java:410)
at brave.Tracer.toSpan(Tracer.java:382)
at brave.Tracer.toSpan(Tracer.java:376)
at brave.LazySpan.span(LazySpan.java:141)
at brave.LazySpan.context(LazySpan.java:40)
at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)
at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:61)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I think it is on this line in TraceRunable:
I am of the opinion that the TraceRunanble should not prevent the delegate Runnable in any way. We might benefit from a check before calling the nextSpan
method or catching this exception somehow.
Sample
I can’t reliably reproduce it, it is sporadic. We have on the order of 100 tests using flowable but I get one error every now and then.
Flowable can be set up with a custom implementation of a JobExectuor. I have it set up like so:
private AsyncExecutor initAsyncExecutor(String tennantId, final int maxConcurrent) {
final DefaultAsyncJobExecutor asyncExecutor = new DefaultAsyncJobExecutor();
asyncExecutor.setAutoActivate(false);
// other setup, irrelevant to the case
// customized default async job executor initialization with tracing info
// org.flowable.job.service.impl.asyncexecutor.DefaultAsyncJobExecutor.initAsyncJobExecutionThreadPool
asyncExecutor.setThreadPoolQueue(new ArrayBlockingQueue<>(queueSize));
asyncExecutor.setExecutorService(
new TraceableExecutorService(
this.applicationContext,
new ThreadPoolExecutor(
asyncExecutor.getCorePoolSize(),
asyncExecutor.getMaxPoolSize(),
asyncExecutor.getKeepAliveTime(), TimeUnit.MILLISECONDS,
asyncExecutor.getThreadPoolQueue(),
new BasicThreadFactory.Builder()
.namingPattern(tennantId + "-flowable-async-job-executor-thread-%d")
.build()
)
)
);
return asyncExecutor;
}
P.S. I am really grateful to you guys for the libraries you produce. They make my job easier!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 25 (14 by maintainers)
FYI - I’ve submitted a fix for this to Brave in https://github.com/openzipkin/brave/pull/1306
In the meantime, here’s a hack for anyone else impacted to resolve it until that change can be merged and released:
You can file an issue in Brave and maybe link this one?
@FranPregernik To have a better understanding about what could have gone wrong and what Brave is doing under the hood, try to follow the stacktrace (here’s some help in case you need it). Let me help walking through the section that could help:
In the sample above, you use a
TraceableExecutorService
as anExecutorService
implementation. If you look into that class, you will see that it wraps yourRunnable
into aTraceRunnable
. SoTraceRunnable.run
is called becauseTraceableExecutorService
is used.As you move forward in the stacktrace, you can see that
BraveTracer.nextSpan
is called from TraceRunnable.run(TraceRunnable.java:61):This also shows where the
LazySpan
is coming from:this.parent = tracer.currentSpan();
.So if
tracer.currentSpan()
gives you the same instance ofSpan
for two differentTraceRunnable
instances, throughBraveTracer.nextSpan(BraveTracer.java:52)
,BraveSpan.context(BraveSpan.java:48)
, andLazySpan.context(LazySpan.java:40)
,LazySpan.span
can be called concurrently.Please let me know if this makes sense, also, since I’m not really able to repro this, this is still a guess.
What I mean is more or less the following: