kotlinx.coroutines: CancellationException can be thrown not only to indicate cancellation

https://betterprogramming.pub/the-silent-killer-thats-crashing-your-coroutines-9171d1e8f79b raises a good point: CancellationException should indicate that the coroutine in which it is raised is cancelled, but it’s not always actually the case:

    fun testFoo() {
        val scope = CoroutineScope(Dispatchers.IO)
        val f = scope.async {
            delay(1000)
        }
        scope.cancel()
        var result = false
        runBlocking {
            launch {
                f.await() // throws CancellationException, silently stopping the `launch` coroutine
                result = true
            }.join()
        }
        check(result) { "issue" }
    }

Here, the programmer didn’t make the mistake of catching a CancellationException and rethrowing it somewhere else, but still, the CancellationException leaked from a different scope.

About this issue

Original URL
State: open
Created a year ago
Reactions: 15
Comments: 18 (9 by maintainers)

Commits related to this issue

Improve the explanation of how `await` throws exceptions Fixes two issues: * It is surprising for some users that the same exception can be thrown several times. Clarified this point explicitly. * ... — committed to Kotlin/kotlinx.coroutines by dkhalanskyjb 5 months ago
Improve the explanation of how `await` throws exceptions Fixes two issues: * It is surprising for some users that the same exception can be thrown several times. Clarified this point explicitly. * ... — committed to Kotlin/kotlinx.coroutines by dkhalanskyjb 5 months ago
Improve the explanation of how `await` throws exceptions Fixes two issues: * It is surprising for some users that the same exception can be thrown several times. Clarified this point explicitly. * ... — committed to Kotlin/kotlinx.coroutines by dkhalanskyjb 5 months ago

Most upvoted comments

If you are using deferred.await() in our code and you want to distinguish the case where the coroutine you’re awaiting for was cancelled or not, you already can do it by explicitly checking deferred.isCancelled. However, I don’t think the need to make this distinction happens very often.

IMO, the great takeaway from the original post on the “The Silent Killer” is this pattern:
try {
    doStuff()
} catch (error: Throwable) {
    coroutineContext.ensureActive()
    handleError(error)
}
That should be the recommended pattern to use instead of trying to catch CancellationException and ignoring it. That is a clear documentation thing, but we might also consider doing some utilities to help with it.

Please don’t make that the recommended pattern. The point of CancellationException is that it’s the only exception that safely bubbles up to the next checkpoint (be that launch or some try block with a custom subclass of CancellationException). This bubbling behavior is very important. The proposed pattern is bad because intermediate try-catch layers will believe there is an error instead of controlled backtracking. We use CancellationException as a signal for safe abortion in our codebase because that’s the only exception which gets actively ignored by intermediate code layers.

The proper solution is to fix the APIs where possible and introduce new fixed APIs where backwards compatibility is needed. The ensureActive() workaround is imprecise and not a true causal relationship.

wkornewald on Apr 28, 2023

If you are using deferred.await() in our code and you want to distinguish the case where the coroutine you’re awaiting for was cancelled or not, you already can do it by explicitly checking deferred.isCancelled. However, I don’t think the need to make this distinction happens very often.

IMO, the great takeaway from the original post on the “The Silent Killer” is this pattern:

try {
    doStuff()
} catch (error: Throwable) {
    coroutineContext.ensureActive()
    handleError(error)
}

That should be the recommended pattern to use instead of trying to catch CancellationException and ignoring it. That is a clear documentation thing, but we might also consider doing some utilities to help with it.

However, there is also one thing that can be fixed in the coroutines library itself in a backwards compatible way. The current state of affairs is that the coroutine that dies by a rogue CancellationException indeed does so completely silently, even though it is not cancelled itself. This can be clearly demonstrated by the following code:

suspend fun main() {
    val cancelledDeferred = CompletableDeferred<Int>()
    cancelledDeferred.cancel()
    supervisorScope {
        // This coroutine fails in a regular way
        launch(CoroutineName("one")) {
            error("Regular failure")
        }
        // This coroutine fails while waiting on cancelled coroutine
        launch(CoroutineName("two")) {
            cancelledDeferred.await()
        }
    }
}

Playground link

Neither coroutine “one” nor coroutine “two” is cancelled, both fail with exception, yet only the exception for the failure of the first coroutine is shown. We can consider changing this logic to report that the second coroutine has failed with CancelletationException, too.

elizarov on Mar 13, 2023

Overall, I agree with your points.

should be handled by the library if possible, even if getting there with backwards compatibility could be painful.

This is the main problem. Yes, this would be painful, to the point I personally would call hopeless. kotlinx-coroutines guarantees, for example, that if library A depends on B, where B only uses stable APIs, and C depends on kotlinx-coroutines, then upgrading the version of kotlinx-coroutines should not require recompiling the library B. So, even if we somehow magically manage to make all code automatically behave strictly better after it’s recompiled, there’s a whole world of binary compatibility headaches of the form “what if A sends a Deferred to B,” or “what if B returns a Deferred to A,” or “what if B calls cancel, but A checks the cancellation exception,” and so on, and so on. There’s a plethora of interactions we’d need to consider to even hope to do this correctly. By my estimates, this would amount to dropping everything else we’re doing and brainstorming just this task, and even then, there wouldn’t be a guarantee that we’d be able to evolve this (really ubiquitous) API.

If there ever is a kotlinx.coroutines 2.0 that is allowed to break compatibility freely (NB: I don’t mean indiscriminately), I’m certain this is an area we’d do differently, more in line with what you’re saying.

if this coroutine is cancelled, suppress the other exception or cancellation signal incoming and replace it with my cancellation signal.

I agree that the pattern proposed above is suboptimal. Here’s a better one:

try {
  // do anything
} catch (e: Throwable) {
  if (e is CancellationException) currentCoroutineContext().ensureActive()
  handleException(e)
}

The pattern without if (e is CancellationException) isn’t in line with the prompt cancellation guarantee that a lot of our functions give. In essence, prompt cancellation states that cancellation of the calling coroutine takes precedence over normal values, but not over exceptions. Without the if, the cancellation exception really does suppress the other exception, which is inconsistent and probably non-desirable.

dkhalanskyjb on Feb 14, 2024

I was debugging something like this for two days(in a production environment, had to add many log/counter and deploy and wait a reproducer, do that again for at least 20 times, VERY bad experience), even suspected at one point that it was a compiler BUG…

// simplified reproducer, the original is more complicated
// thus it's even harder to debug such kind of control-flow issue
suspend fun structuredConcurrency(toBeProcessedData:Collection<Any>){
    coroutineScope{
        for(data in toBeProcessedData)
            launch{handleData(data)}
    }
    println("all data was processed")
}

the all data was processed got printed so I assume that all data was processed(seriously, is this code looks anything wrong for anybody? but guess what, it’s not, and NO EXCEPTIONS IS LOGGED! it looks like a mysterious break was in the for loop, the coroutine just vanishes… no return, no exception, how is that possible…

well, this is because deep down in the handleData(data) callstack, someone introduced withTimeout… a reasonable change, but blow up the whole thing 😦

the point is, throwing CancellationException IS NOT THE SAME as cancelling a coroutine! ANYTHING CAN THROW ANY EXCEPTION, isn’t that the point of structured concurrency? if we call something, it has to EITHER RETURN OR THROW, not vanishing without a trace…

Anyway I diverge, I agree with @elizarov “We can consider changing this logic to report that the second coroutine has failed with CancelletationException, too”, and here’s my advice:

when cancelling a coroutine, the automatically produced CancellationException should have a private mark on it / or the cancelling coroutine should have a private mark of the current propagating CancellationException
if the marked CancellationException propagated, and is in the correct structured concurrency way, keep the current behavior
if any other exception propagated, or if the afore mentioned CancellationException is not propagating in the correct structured concurrency way(for example the case mentioned in the original post, or the user grab that reference and throwing them in another coroutine etc), treat them as error(throw outwards and don’t swallow them
please don’t complicate things by introducing Result or un-subclassing TimeoutCancellationException from CancellationException(backwards incompatible and very hard to find in tons of catch(e:CancellationException){...}

revintec on Jul 19, 2023

2 cents on the design, please destroy everything I say.

AFAIU CancellationException is just a signal to terminate a cancelled coroutine promptly without breaking the exceptions effect system (aka run the finally blocks). If there were no exceptions, then a coroutine could be terminated by simply suspending it without resuming it. However, currently kotlinx coroutines

uses CancellationException not only as a signal, but also as a result (the result of a cancelled deferred coroutine can be CancellationException).
doesn’t check for the origin of the CancellationException (even me manually throwing a CancellationException can result in a silent coroutine death)

fun main() = runBlocking {
    launch {
        throw CancellationException()
    }
    println("Hello, world!")
}

Kotlinx coroutines also encodes three possible states:

Finished(success)
Finished(failure)
Finished(cancelled)

with a single cause: Throwable? where the cancelled and failure states are unionized in the cause != null (cancelled=cause is CancellationException). In practice it makes sense, as the result of a couroutine can be technically an exception or a result, but IMHO it’s important to remember that we use an exception just because we are forced to do so just for the cancellation signal to not break the exception system. Once that the exception reaches the coroutine it should be unpacked to its actual meaning: is this the cancellation signal? then the coroutine state changes to cancelled. Is this not the cancellation signal? The the coroutine state changes to completed exceptionally. Again, if there were no exceptions and we implemented cancellation by just suspending the coroutine, we would likely have the cancelled state as a seperate one instead of encoding it as cause is CancellationException.

If the above makes sense, then kotlinx coroutines could:

not use CancellationException as the result of a deferred coroutine. For example, if the deferred coroutine is cancelled, then await could throw a CancelledCoroutineException (a failure).
add a way to identity the correct cancellation signals, for example, adding a token to each CancellationException which allows the coroutine completed with it to verify the token and update the state in response.

In an ideal world, CancellationException could be an internal implementation detail of kotlinx coroutines which is used to implement the signal transparently. In practice, developers catching Throwable will always also catch the CancellationException, so they have to rethrow it.

Instead of

interface Job {
    public fun cancel(cause: CancellationException? = null)
}

we could have

internal class CancellationException: Throwable

interface Job {
    public fun cancel(message: String? = null, cause: Throwable? = null)
}

fun Throwable.rethrowIfCancellationSignal() {
  if (this is CancellationException) { throw this }
}

The state of the coroutine conceptually would be:

sealed class State {
    data class Completed(val value: Any?): State()
    data class Failed(val cause: Throwable): State()
    data class Cancelled(val message: String, val cause: Throwable?): State()
}

and Deferred.await would do

fun await() = when(state) {
    is State.Cancelled -> throw CoroutineCancelledException(state.message, state.cause)
    is State.Completed -> state.value
    is State.Failed -> throw state.cause
}

This is just to say that IMHO this should not be “solved” by the developer using the coroutineContext.ensureActive() pattern, which is just a way of condensing the following logic:

if this coroutine is cancelled, suppress the other exception or cancellation signal incoming and replace it with my cancellation signal.
if this coroutine is not cancelled, then the exception is either a failure or a cancellation signal of another coroutine (how did it get there though!), handle them separately.

This is not a great pattern IMHO, and should be handled by the library if possible, even if getting there with backwards compatibility could be painful.

francescotescari on Feb 14, 2024

@throwable-one, it seems your problem is not with how exceptions are handled in the library overall but with the specific behavior of await. Here’s one way to implement the function you want:

@OptIn(ExperimentalCoroutinesApi::class)
suspend fun <T> Deferred<T>.newAwait(): AwaitResult<T> {
    join()
    return when (val exception = getCompletionExceptionOrNull()) {
        null -> AwaitResult.Ok(getCompleted())
        is CancellationException -> AwaitResult.Stopped(exception)
        else -> throw exception
    }
}

dkhalanskyjb on Nov 29, 2023

Just a Value (with the implicit Throwable that can surface at any point in a JVM program) is even more minimal.

dkhalanskyjb on Mar 9, 2023

Well, this is as minimal as Value | Throwable union can be at the moment

dovchinnikov on Mar 9, 2023

This is a robust solution, but non-Kotlin-style. Kotlin tries to keep hierarchies of return types as thin as possible to save the users the boilerplate of unpacking the results. For example, Map<Int, String?>::get doesn’t return an Optional<String?>, it just returns String?.

dkhalanskyjb on Mar 9, 2023

The above could be implemented as an extension function:

suspend fun Deferred<T>.awaitResult() : Result<T> {
  return try {
    Result.success(await())
  }
  catch (ce: CancellationException) {
    // this can "override" the original CE, but the current coroutine cancellation is preferred, so it's okay
    currentCoroutineContext().job.ensureActive()
    Result.failure(t)
  }
  catch (t: Throwable) {
    // resume awaitResult with the original throwable result to give a chance to handle it
    // even if the current coroutine is cancelled concurrently
    Result.failure(t)
  }
}

dovchinnikov on Mar 9, 2023

await (probably another function since this one is taken) should return kotlin.Result

if await resumes with CE, then it means the current coroutine was cancelled;
if await resumes with Result, then async was completed before the current coroutine was cancelled, and it represents a value/CE/failure of the async.

dovchinnikov on Mar 9, 2023

Need to dive deep into the implementation to see if this is possible, but it looks like modifying Deferred to throw a non-CancellationException may not be a good solution.

Consider this simplified scenario:

runBlocking {
  val result = async(Dispatchers.Default) {
    delay(5.seconds)
  }
  launch(Dispatchers.Unconfined) {
    result.await()
  }
  delay(1.seconds)
  cancel() // cancel the common scope
}

Here, we possibly have a race: the global scope is cancelled, which causes the coroutines in async and launch also to be cancelled. However, it can happen that this sequence of events happens before launch gets cancelled:

async gets cancelled,
the Deferred learns about the cancellation exception,
await finishes with a non-CancellationException in the launch,
which causes the launch to fail with a non-CancellationException,
with which the entire runBlocking fails!

So, the code can fail with one exception or rarely with another. This is just one simplified example of a broader class of races that we could introduce.

dkhalanskyjb on Mar 9, 2023