controller-runtime: Allow a result that indicates the reconciliation is incomplete and does not trigger the exponential backoff logic

After discussing this with @detiber, we realized there’s no good solution for the following case:

A request is received
The request cannot proceed because of a missing dependency on resource A
The reconciler is watching events for resource A
The reconciler returns a result that indicates the reconciliation is incomplete due to reason X, but the request should not be requeued

A reconciler is using one or more watches to trigger requests
A request cannot proceed due to a missing dependency
If result.RequeueAfter > 0 then the request is added to the queue for processing after the value specified by result.RequeueAfter
If result.Requeue is true then the request is added to the queue with the same exponential backoff logic used when an error is returned
If an error is returned then the request is added to the queue with the exponential backoff logic

Today there is currently no way to indicate a reconciliation is incomplete without also having the request requeued by the manager either via an explicit amount of time or the exponential backoff logic (due to error or Requeue == true).

There should be a way to signal:

The reconciliation is incomplete
Do not requeue, the reconciler is watching the resources necessary to trigger its own events

Thanks!

About this issue

Original URL
State: open
Created 5 years ago
Reactions: 1
Comments: 29 (25 by maintainers)

Commits related to this issue

StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
StorageCluster: don't consider pending uninstall as error In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backo... — committed to raghavendra-talur/ocs-operator by raghavendra-talur 4 years ago
Various reconciliation fixes - Reduce log noise by logging errors instead of successes - Use context logger provided by controller-runtime - Patch status instead of update to avoid "the object has be... — committed to airshipit/vino by seaneagan 3 years ago

Most upvoted comments

Returning an error from a Reconciler seems to have limited value. It’s hard to know what to do in response that’s useful. I think we should consider changing the Reconciler interface so that it does not return an error at all.

When a Reconciler returns an error, three things happen (I’m ignoring metrics for the moment):

An error message gets logged
The request is requeued with backoff
reconcileHandler returns false, whose meaning is not documented, but causes the worker function to return and get called again after a jitter period.

(2) is hard to get right, as identified by this issue. It’s hard to know which errors deserve a retry. Isn’t it better to just let the Reconciler make its own decision, and set Requeue: true on the response when it determines that’s appropriate? It’s easy for a Reconciler author to make their own helper function that modifies or produces a response based on an error. That may be just as easy as giving them some way to pass a filter function or similar to controller-runtime that does the same.

(3) I’m not sure what the value is in this. What’s it accomplishing?

(1) also seems limited in value. Should all errors be logged the same way? Do all errors deserve to be logged at all? Generally no. We could instead expect the Reconciler to do its own error handling and log a useful message if/when/how appropriate. An optional helper function to generically log errors from within the Reconciler would be just as useful as the log behavior today.

Back to metrics, as it is today with a generic error count, it’s not clear what is being measured. I don’t think a broad error count, where an error could mean many different things, is actionable or particularly useful. A Reconciler implementation can capture its own metrics that are more meaningful.

Lastly to testing. As already observed, it’s hard to define what it means for a Reconciler run to be “incomplete”. Many Reconcilers are designed to make incremental progress and re-run many times while converging to desired state. Perhaps what is most useful to communicate in terms of “completeness” is whether the Reconcile logic is blocked from progressing toward desired state. I’m not sure if there’s a good generic way to capture that, or if that should be part of the Reconciler interface at all. Maybe that’s best handled and tested as an implementation detail behind the Reconciler interface.

In many cases when progress toward desired state is blocked, it’s useful to communicate that on the object’s Status. An example is when a required Secret is missing or has invalid credentials, because it’s a problem that the API user can fix. In these cases, the Status is a natural place for a test to determine success or failure.

In sum, when a Reconciler returns an error, it’s hard to know what to do with it. Rather than have an interface that lets a Reconciler pass us an error and something that helps us understand how to handle it, we could just let the Reconciler handle it.

mhrivnak on Feb 20, 2020

I came through this and #377 as well. @DirectXMan12 Some thoughts:

One approach:

type Result struct {
	// Requeue tells the Controller to requeue the reconcile key.  Defaults to false.
	Requeue bool
        // RequeueError tells the Controller to requeue the reconcile key when error is not nil.  If nil defaults to true.
	RequeueError *bool
	// RequeueAfter if greater than 0, tells the Controller to requeue the reconcile key after the Duration.
	// Implies that Requeue is true, there is no need to set Requeue to true at the same time as RequeueAfter.
	RequeueAfter time.Duration
}

if result, err := c.Do.Reconcile(req); err != nil {
	if requeueError(result.RequeueError) {
		c.Queue.AddRateLimited(req)
		log.Error(err, "Reconciler error", "controller", c.Name, "request", req)
		ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
		ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()
		return false
	}
	c.Queue.Forget(obj)
	return false
}

func requeueError(requeueError *bool) bool {
	if reconcileError == nil {
		return true
	}
	return *requeueError
}

Another approach:

if result, err := c.Do.Reconcile(req); err != nil {
	if requeueError(err) {
		c.Queue.AddRateLimited(req)
		log.Error(err, "Reconciler error", "controller", c.Name, "request", req)
		ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
		ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()
		return false
	}
	c.Queue.Forget(obj)
	return false
}

func requeueError(err error) bool {
	if apierrors.IsNotFound(err) {
		return false
	}
	if _, ok := err.(*client.IgnorableError); ok {
		return false
	}

	return true
}

// IgnorableError represents an error that should not be
// requeued for further processing.
type IgnorableError struct {}

// Error implements the error interface
func (e *IgnorableError) Error() string {
	return fmt.Sprintf("%s", e)
}

enxebre on Oct 2, 2019

To add a bit of additional context it is currently very difficult to test controller-runtime based reconciliation loops with the current behavior if there are cases where only partial reconciliation is expected due to external dependencies that trigger reconciliation based on watches.

Allowing for a case where Requeue is explicitly set to false and an error is returned would allow for easier testing of these cases.

detiber on Sep 30, 2019