kotlinx.coroutines: Breaking: Rethink atomicity of certain low-level primitives
There is a number of primitives in the library that provide “atomic cancellation”:
As their documentation states:
Cancellation of suspended invocation is atomic — when this function throws
CancellationExceptionit means that the operation was not performed. As a side-effect of atomic cancellation, a thread-bound coroutine (to some UI thread, for example) may continue to execute even after it was canceled from the same thread in the case when this operation was already resumed and the continuation was posted for execution to the thread’s queue.
This is a surprising behavior that needs to be accounted for when writing higher-level primitives like Flow. See #1265 as an example problem and discussion in #1730.
It all points to the need to rethink the atomic cancellation behavior of the underlying primitives.
More background on the atomic cancellation
The concept of atomic cancellation has been introduced early in the design of kotlinx.coroutines, before 1.0 release. Originally it was used only in channels with the goal of enabling safe ownership transfer of closeable resources between different coroutines.
Hereinafter we use the word resource to denote a reference to any object that must be explicitly closed. Usually, but not always, it encapsulates some native resource that is not managed by garbage collection. Losing a reference to such a resource creates a leak.
For example, consider a coroutine that opens a file and gets an InputStream. InputStream is a resource that must be explicitly closed after use. Now, if the coroutine needs to send it to another coroutine for processing via val channel = Channel<InputStream>(), then it has the following code:
val input = file.inputStream() // opens an input stream to a file
channel.send(input) // send for processing
the other coroutine receives the input stream and continues to work with it, closing it afterward:
val input = channel.receive()
input.use { process(it) } // closes at the end
If send and receive were simply cancellable, as most other suspending functions created with the help of suspendCancellableCoroutine are, then the sender would have no way of knowing if the send had actually completed successfully or not. It could be canceled before completing send or it could send and then get canceled. The receiver can get canceled after receiving it, too. There would be no way to ensure safe ownership transfer from one coroutine to another. As a solution to this problem the concept of “atomic cancellation” was born. In kotlinx.coroutines up until now, the above code is given the following guarantees:
- Atomic send cancellation: if
sendcompletes successfully then the reference was put into the channel, otherwise it was not. - Atomic receive cancellation: if
receivecompletes successfully then the reference was retrieved from the channel, otherwise it was not.
Together, these two guarantees support the leak-free transfer of ownership between coroutines when the code on the send side modified in a specific way like this:
val input = file.inputStream() // opens file
try {
channel.send(input)
} catch (e: Throwable)
input.close() // close input on any exception, we know it was not delivered
throw e // retrow
}
Public API status of atomic cancellation
The “atomic cancellation” was never widely publicized. It was briefly documented by a single paragraph in both send and receive documentation, but it was not mentioned in any of the guides concerning coroutines. Moreover, as it was shown above, the atomic cancellation itself only enables safe ownership transfer. One still has to write the code with extreme caution to ensure that resources are transferred safely between coroutines and the corresponding code one has to write was never explained in the documentation before.
The atomic cancellation was only employed by a small number of kotlinx.coroutines functions and it was not possible to write user-defined primitives with similar atomic-cancellation behavior. The corresponding low-level mechanisms are internal in the library.
Problems with atomic cancellation
Atomic cancellation creates a major hurdle when using channels in UI code. In UI application cancellation is used to manage the lifecycle of UI elements. For example, consider a coroutine that is running in the scope of some UI view:
val data = dataChannel.receive() // receive some data
updateUI(data)
When the data is received from the channel the line with updateUI(data) is not executed immediately. The corresponding continuation is scheduled for execution onto the main thread and needs to wait until the main thread is available. However, while this continuation waits in the queue the corresponding UI view might get destroyed. Normally, with all the other cancellable suspending functions that use suspendCancellableCoroutines, when the main thread is finally ready to execute the continuation the cancellation state of the coroutine is checked and the CancellationException is thrown instead of executing updateUI(data). However, this check for cancellation is not performed for Channel.receive continuation because there is an “atomic cancellation” guarantee for receive. The data element was already received and must be delivered, so updateUI(data) will get executed despite the fact that the coroutine is already canceled. On Android, in particular, an attempt to update UI view that was already destroyed would lead to an exception.
In practice, it means that every time a Channel (or another atomically-cancellable primitive) is used in the main thread, one must not forget to manually add the check for cancellation of the current coroutine:
val data = dataChannel.receive() // receive some data
ensureActive() // check for cancellation manually
updateUI(data)
Forgetting to add this check leads to hard-to-find bug that only rarely manifests itself.
This design creates irregularities in the coroutines API surface. All suspending functions in kotlinx.coroutines are cancellable in a regular fashion and it is guaranteed that when they resume successfully the corresponding coroutine was not canceled. However, there are a few exceptional functions with “atomic cancellation” behavior that one must remember by heart. There is no consistent naming to make them stand apart. They look like all the other suspending functions and they are cancellable, too, but they can resume successfully when the coroutine was already cancellated. Tricky and error-prone.
Moreover, atomic cancellation does not make resource transfer easy. Even with atomic cancellation, it is still a tricky and error-prone endeavor. In addition to the intricate code-dance, one has to perform (as shown above), there is a problem with channel cancellation. When the receiver on the channel does not plan to continue receiving data and cancels the whole channel to indicate that, all references that were stored in the channel buffer are simply dropped. It means that channel cancellation cannot be used when channel is used to transfer resources. You can work around it when you use a channel directly (avoid cancellation, manually retrieve and close all resources from the channel you no longer need), but this makes it impossible to design resource-leak-free high-level primitives like Kotlin Flow that must rely on channel cancellation for their own needs.
All in all, atomic cancellation fails to deliver on the promise of being a safe solution for resource transfer from one coroutine to another.
Additional details on Mutex and Semaphore
The atomic cancellation of Mutex.lock and Semaphore.acquire does not bring any advantages as with channels, but only all the problems. The typical pattern of mutex/semaphore use is:
mutex.lock()
try {
doSomething()
} finally {
mutex.unlock()
}
The correctness of this pattern does not require that lock is atomically cancellable in the same way as send. All it needs is that a canceled lock attempt does not keep the mutex locked. CancellableContinuation already provides resume(value) { doOnCancel() } API to enable a correct lock implementation. See resume.
Proposed solution
The proposal is to completely drop the concept of “atomic cancellation” from kotlinx.coroutines and make all the suspending functions in the library to be consistently cancellable in a regular way with a guarantee that a suspending function does not resume normally in a canceled coroutine.
In order to address the rare use-case of actually sending resources between coroutines via channels a different easy-to-use mechanism will be introduced together with this change. See #1936 for a concrete proposal.
The key rationale for this particular and radical way to address the problem is that atomic cancellation was underdocumented and very error-prone to start with. Thus, it is extremely unlikely that there is a significant amount of code in the wild that correctly exploits those atomic cancellation guarantees to perform a leak-free transfer of resources between coroutines. At the same time, changing the behavior to regular cancellation will likely fix a lot of hard-to-find cancellation bugs in the existing Android applications that use coroutines.
Impact of the change
This is breaking change. Some code might have been relying on the atomic cancellation guarantees. After this change this code might encounter two kinds of errors:
- Receiving
CancellationExceptionfromchannel.sendthe code might assume that item was not sent to the channel and will go to close the corresponding resource. However, in fact, without atomic cancellation guarantee onsend, the resource might have been actually delivered to the receiver. - The
receivecall for a resource what was previously successfully sent channel to the channel might getCancellationExceptionwithout atomic cancellation guarantee onchannel.receive. Thus, the reference is lost and and the resource leaks.
Whether this is an acceptable behavior change for a major release is TBD. The key question is whether there are any widely used libraries that could be seriously impacted by this change. We plan to publish at least one -Mx (milestone) release with the proposed change so that the impact of this change could be evaluted while running the actual code with the updated kotlinx.coroutines.
Possible alternatives
We can try to maintain a certain degree of backward compatibility while making this change:
Option 1: Maintain binary backward compatibility. It means that we’ll retain the old behavior (with atomic cancellation) for previously compiled code but any code that gets compiled against new version of kotlinx.compatibility will get new behavior (without atomic cancellation).
- Cons: It will vastly complicate the code and this kind of backward compatibility guarantee will be very hard to test and to support in the future.
Option 2: Maintain source backward compatibility. It means that old send and receive will behave as before (with atomic cancellation) and some new methods to perform regularly cancellable send and receive are introduced. It can be done by either having a dedicated Channel constructor to customize cancellation behavior or dedicated send/receive methods with new names.
- Cons: It will make the public API surface of
kotlinx.coroutinesmore complex. What’s worse, as was shown in the preceding overview, the whole atomic cancellation behavior is hard-to-use and error-prone anyway. It does not deserve to be maintained in the long run, so the corresponding “old” methods that perform atomically-cancellablesend/receivewill have to be deprecated, meaning a transition to less natural names.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 17
- Comments: 18 (13 by maintainers)
Commits related to this issue
- Remove atomic cancellation support This is a problematic for Android when Main dispatcher is cancelled on destroyed activity. Atomic nature of channels is designed to prevent loss of elements, which ... — committed to Kotlin/kotlinx.coroutines by elizarov 5 years ago
- Remove atomic cancellation support This is a problematic for Android when Main dispatcher is cancelled on destroyed activity. Atomic nature of channels is designed to prevent loss of elements, which ... — committed to Kotlin/kotlinx.coroutines by elizarov 5 years ago
- Remove atomic cancellation support This is a problematic for Android when Main dispatcher is cancelled on destroyed activity. Atomic nature of channels is designed to prevent loss of elements, which ... — committed to Kotlin/kotlinx.coroutines by elizarov 5 years ago
- Remove atomic cancellation support This is a problematic for Android when Main dispatcher is cancelled on destroyed activity. Atomic nature of channels is designed to prevent loss of elements, which ... — committed to Kotlin/kotlinx.coroutines by elizarov 5 years ago
- Remove atomic cancellation support This is a problematic for Android when Main dispatcher is cancelled on destroyed activity. Atomic nature of channels is designed to prevent loss of elements, which ... — committed to Kotlin/kotlinx.coroutines by elizarov 5 years ago
- Remove atomic cancellation support This is a problematic for Android when Main dispatcher is cancelled on destroyed activity. Atomic nature of channels is designed to prevent loss of elements, which ... — committed to Kotlin/kotlinx.coroutines by elizarov 5 years ago
- Breaking: Get rid of atomic cancellation and provide a replacement (#1937) This is a problematic for Android when Main dispatcher is cancelled on destroyed activity. Atomic nature of channels is des... — committed to Kotlin/kotlinx.coroutines by elizarov 4 years ago
- Breaking: Get rid of atomic cancellation and provide a replacement (#1937) This is a problematic for Android when Main dispatcher is cancelled on destroyed activity. Atomic nature of channels is des... — committed to recheej/kotlinx.coroutines by elizarov 4 years ago
- Breaking: Get rid of atomic cancellation and provide a replacement (#1937) This is a problematic for Android when Main dispatcher is cancelled on destroyed activity. Atomic nature of channels is des... — committed to recheej/kotlinx.coroutines by elizarov 4 years ago
- GuaranteedEffect - Wrapper for a side effect that can guarantee that it handled exactly once. Together with `Store` it can also guarantee delivery of the side effect. Also see: [Proposal] Primitive o... — committed to fluxo-kt/fluxo by amal 2 years ago
- GuaranteedEffect - Wrapper for a side effect that can guarantee that it handled exactly once. Together with `Store` it can also guarantee delivery of the side effect. Also see: [Proposal] Primitive o... — committed to fluxo-kt/fluxo by amal 2 years ago
@elizarov: Let’s consider for example an Android app which receives push notifications, and wants to have them displayed as a dialog inside the app. This can be done by creating a
Channelwhere all these notifications are sent, and have everyActivitylaunch a coroutine in it’s foreground scope to receive them from it, and show them as dialogs to the user.Now if a new
Activityis launched on top of a previous one, then the previousActivitywill be moved to the background, and the coroutine that it had launched to receive notifications from theChannelwill be cancelled, while the newActivitywill launch a new coroutine of it’s own to start receiving notifications. If there were any notifications that were in transit to the previousActivitywhen this happened, then they will be discarded and lost, and not delivered to the newActivity.To keep the case simple, I left out a fallback for handling the notifications when the app is in the background, which in most cases would be implemented by having them shown in Android’s notification drawer. This could be achieved by adding a coroutine to receive and handle the notifications when the app is in the background, which would encounter the same issues mentioned before. I didn’t include this in the main scenario, because there could also be other ways to implement a fallback, such as by using
offer, but that would still have issues in the described scenario and might also have other issues.Good point, added.
Update: A different, simpler, and more convenient replacement for safe transfer of resources via channels will be introduced. See #1936.
@pakoito The pitfalls of atomic cancellation can be demonstrated with this code:
Try it in Kotlin Playground: https://pl.kotl.in/x5Bi-IiAO and see that it gets:
That is failing line 12 is
check(isActive). It is a very counterintuitive behavior that a canceled coroutine had resumed normally. More so problematic as it affects only channels. If you do something likedelay(100), then you will not experience such a problem.With the proposed change the same code completes without errors because line 10 with
val ok = channel.receive()would throw aCancellationException, preventing further execution of the canceled consumer coroutine.