spring-cloud-sleuth: Parent can only be null in a local root

Describe the bug

I am using Spring Cloud Sleuth 3.0.1-SNAPSHOT (as of 2020-12-31). The bug is really sporadic but does cause the worfkflow engine (flowable, which I have instrumented manually) to not call the next tasks in the BPMN workflow.

The error is this:

Exception in thread "orders-flowable-async-job-executor-thread-4" java.lang.AssertionError: Bug (or unexpected call to internal code): parent can only be null in a local root!
	at brave.internal.recorder.PendingSpans.getOrCreate(PendingSpans.java:89)
	at brave.Tracer._toSpan(Tracer.java:410)
	at brave.Tracer.toSpan(Tracer.java:382)
	at brave.Tracer.toSpan(Tracer.java:376)
	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
	at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)
	at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:61)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I think it is on this line in TraceRunable: image

I am of the opinion that the TraceRunanble should not prevent the delegate Runnable in any way. We might benefit from a check before calling the nextSpan method or catching this exception somehow.

Sample

I can’t reliably reproduce it, it is sporadic. We have on the order of 100 tests using flowable but I get one error every now and then.

Flowable can be set up with a custom implementation of a JobExectuor. I have it set up like so:

 private AsyncExecutor initAsyncExecutor(String tennantId, final int maxConcurrent) {
        final DefaultAsyncJobExecutor asyncExecutor = new DefaultAsyncJobExecutor();
        asyncExecutor.setAutoActivate(false);

        // other setup, irrelevant to the case

        // customized default async job executor initialization with tracing info
        // org.flowable.job.service.impl.asyncexecutor.DefaultAsyncJobExecutor.initAsyncJobExecutionThreadPool
        asyncExecutor.setThreadPoolQueue(new ArrayBlockingQueue<>(queueSize));
        asyncExecutor.setExecutorService(
            new TraceableExecutorService(
                this.applicationContext,
                new ThreadPoolExecutor(
                    asyncExecutor.getCorePoolSize(),
                    asyncExecutor.getMaxPoolSize(),
                    asyncExecutor.getKeepAliveTime(), TimeUnit.MILLISECONDS,
                    asyncExecutor.getThreadPoolQueue(),
                    new BasicThreadFactory.Builder()
                        .namingPattern(tennantId + "-flowable-async-job-executor-thread-%d")
                        .build()
                )
            )
        );

        return asyncExecutor;
    }

P.S. I am really grateful to you guys for the libraries you produce. They make my job easier!

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 25 (14 by maintainers)

Most upvoted comments

FYI - I’ve submitted a fix for this to Brave in https://github.com/openzipkin/brave/pull/1306

In the meantime, here’s a hack for anyone else impacted to resolve it until that change can be merged and released:

@Aspect
class BraveTracerNextSpanRetryAspect {

    /*
     * Stupid hack to retry BraveTracer::nextSpan(span) because of a race condition wherein a span being loaded concurrent with the
     * span being flushed will throw an AssertionError. Since the flush will have completed, retrying succeeds.
     */
    @Around("execution(org.springframework.cloud.sleuth.Span org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(..))")
    public Object wrapNextSpan(ProceedingJoinPoint pjp) throws Throwable {
        try {
            return pjp.proceed();
        } catch (AssertionError e) {
            return pjp.proceed();
        }
    }
}

You can file an issue in Brave and maybe link this one?

@FranPregernik To have a better understanding about what could have gone wrong and what Brave is doing under the hood, try to follow the stacktrace (here’s some help in case you need it). Let me help walking through the section that could help:

	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
	at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)
	at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:61)

In the sample above, you use a TraceableExecutorService as an ExecutorService implementation. If you look into that class, you will see that it wraps your Runnable into a TraceRunnable. So TraceRunnable.run is called because TraceableExecutorService is used.

As you move forward in the stacktrace, you can see that BraveTracer.nextSpan is called from TraceRunnable.run(TraceRunnable.java:61):

Span childSpan = this.tracer.nextSpan(this.parent).name(this.spanName);

This also shows where the LazySpan is coming from: this.parent = tracer.currentSpan();.
So if tracer.currentSpan() gives you the same instance of Span for two different TraceRunnable instances, through BraveTracer.nextSpan(BraveTracer.java:52), BraveSpan.context(BraveSpan.java:48), and LazySpan.context(LazySpan.java:40), LazySpan.span can be called concurrently.

Please let me know if this makes sense, also, since I’m not really able to repro this, this is still a guess.

What I mean is more or less the following:

Span childSpan = this.parent != null ? this.tracer.nextSpan(this.parentSpan).name(this.spanName) : this.tracer.nextSpan().name(this.spanName);