controller-runtime: Allow a result that indicates the reconciliation is incomplete and does not trigger the exponential backoff logic
After discussing this with @detiber, we realized there’s no good solution for the following case:
- A request is received
- The request cannot proceed because of a missing dependency on resource A
- The reconciler is watching events for resource A
- The reconciler returns a result that indicates the reconciliation is incomplete due to reason X, but the request should not be requeued
Instead the current logic is:
- A reconciler is using one or more watches to trigger requests
- A request cannot proceed due to a missing dependency
- If
result.RequeueAfter > 0
then the request is added to the queue for processing after the value specified byresult.RequeueAfter
- If
result.Requeue
istrue
then the request is added to the queue with the same exponential backoff logic used when an error is returned - If an error is returned then the request is added to the queue with the exponential backoff logic
Today there is currently no way to indicate a reconciliation is incomplete without also having the request requeued by the manager either via an explicit amount of time or the exponential backoff logic (due to error or Requeue == true
).
There should be a way to signal:
- The reconciliation is incomplete
- Do not requeue, the reconciler is watching the resources necessary to trigger its own events
Thanks!
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 1
- Comments: 29 (25 by maintainers)
Commits related to this issue
- StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
- StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
- StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
- StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
- StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
- StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
- Various reconciliation fixes - Reduce log noise by logging errors instead of successes - Use context logger provided by controller-runtime - Patch status instead of update to avoid "the object has be... — committed to airshipit/vino by seaneagan 3 years ago
Returning an error from a Reconciler seems to have limited value. It’s hard to know what to do in response that’s useful. I think we should consider changing the Reconciler interface so that it does not return an error at all.
When a Reconciler returns an error, three things happen (I’m ignoring metrics for the moment):
reconcileHandler
returnsfalse
, whose meaning is not documented, but causes the worker function to return and get called again after a jitter period.(2) is hard to get right, as identified by this issue. It’s hard to know which errors deserve a retry. Isn’t it better to just let the Reconciler make its own decision, and set
Requeue: true
on the response when it determines that’s appropriate? It’s easy for a Reconciler author to make their own helper function that modifies or produces a response based on an error. That may be just as easy as giving them some way to pass a filter function or similar to controller-runtime that does the same.(3) I’m not sure what the value is in this. What’s it accomplishing?
(1) also seems limited in value. Should all errors be logged the same way? Do all errors deserve to be logged at all? Generally no. We could instead expect the Reconciler to do its own error handling and log a useful message if/when/how appropriate. An optional helper function to generically log errors from within the Reconciler would be just as useful as the log behavior today.
Back to metrics, as it is today with a generic error count, it’s not clear what is being measured. I don’t think a broad error count, where an error could mean many different things, is actionable or particularly useful. A Reconciler implementation can capture its own metrics that are more meaningful.
Lastly to testing. As already observed, it’s hard to define what it means for a Reconciler run to be “incomplete”. Many Reconcilers are designed to make incremental progress and re-run many times while converging to desired state. Perhaps what is most useful to communicate in terms of “completeness” is whether the Reconcile logic is blocked from progressing toward desired state. I’m not sure if there’s a good generic way to capture that, or if that should be part of the Reconciler interface at all. Maybe that’s best handled and tested as an implementation detail behind the Reconciler interface.
In many cases when progress toward desired state is blocked, it’s useful to communicate that on the object’s Status. An example is when a required Secret is missing or has invalid credentials, because it’s a problem that the API user can fix. In these cases, the Status is a natural place for a test to determine success or failure.
In sum, when a Reconciler returns an error, it’s hard to know what to do with it. Rather than have an interface that lets a Reconciler pass us an error and something that helps us understand how to handle it, we could just let the Reconciler handle it.
I came through this and #377 as well. @DirectXMan12 Some thoughts:
One approach:
Another approach:
To add a bit of additional context it is currently very difficult to test controller-runtime based reconciliation loops with the current behavior if there are cases where only partial reconciliation is expected due to external dependencies that trigger reconciliation based on watches.
Allowing for a case where Requeue is explicitly set to
false
and an error is returned would allow for easier testing of these cases.