kubernetes: WATCH from RV "" or "0" breaks watch order invariant

What happened: This stems from discussions in https://github.com/kubernetes/kubernetes/pull/67350#discussion_r256083071

WATCH calls starting at "" or "0" are fundamentally broken and produce invalid output of events not ordered by resource version. This mean when the WATCH is interrupted during this phase and client restarts from last resource version it saw it can either receive some events twice or miss some. Also client can’t detect this initial phase given it’s part of the continuous stream.

This is the case of a watch cache https://github.com/kubernetes/kubernetes/blob/2cbb16bc8dd456c5db72c1667926abdbc87c32c7/staging/src/k8s.io/apiserver/pkg/storage/cacher/watch_cache.go#L434-L461 also watch cache takes the last item as last RV which seems wrong in that case https://github.com/kubernetes/kubernetes/blob/b862590bfcd40d5f82f2a8e1d5d3d7b147b82d55/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L370

and a direct etcd read https://github.com/kubernetes/kubernetes/blob/2cbb16bc8dd456c5db72c1667926abdbc87c32c7/staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go#L172-L186

we need to sort those by RV before putting them to the watch stream.

What you expected to happen: WATCH call always returns events with monotonically increasing RV.

Also there is v2 proposal for changing the behaviour but that seems further then fixing the existing API.

/king bug /priority critical-urgent

/cc @liggitt @wojtek-t @kubernetes/sig-api-machinery-bugs

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 40 (39 by maintainers)

Most upvoted comments

Yeah, I don’t mean to be difficult, but the RV=0 case was literally intended to offer the semantics of a list, then turn into a watch; folks doing this were not expected to want to restart, since they can’t even know whether they ever got a complete state or not. It’s “axiomatic breakdown” vs “historical reasons”. I appreciate the appeal of the axiomatic breakdown, but it’s not connecting with me because it’s not how the feature grew in the first place.

Restartability is an optimization. Personally, I now think it’s a bug that we reused the RV from the objects as the restart token rather than sending that in a side-channel or something, but that can’t be helped now.

Anyway, to be productive, let me restate my position. I don’t mind sorting, provided we have evidence that it is useful (that @wojtek-t’s example case is rare) and not harmful (the cost of the sort is measured and is low). Since clients can fix this at any time by doing a list, then a watch–which we recommend anyway–I don’t consider this a high priority thing to fix.

I think the bookmark based fix is easier to implement and roll out than gathering the above evidence would be. I therefore think we should just document the behavior, and either deprecate it now or fix it when bookmarks have rolled out.

If it were a complete fix, it’d be a different story. Instead, it makes the problem subtler.

I don’t follow that… it fixes the invariant that watch events appear in an order such that their resourceVersion is valid to restart the watch

I’d hope we agree that WATCH should return events in monotonically increasing RV and that it would be best when no RV is specified we just started watching from “now”. That’s unfortunately not the case right now, and all of the proposed changes are not backwards compatible. So we are left with 2 choices we need to make, for short term and long term.

Short term: There is an API call that breaks watch invariant, returns invalid data, can cause loosing events, skipping them and hot loops. We don’t use it in informers, or well design paths in our system. It is still exposed like that to the rest of the ecosystem, and people there didn’t posses list+watch knowledge when they wrote the code calling it. Deprecation won’t solve it, that’s a long term. We can’t change the type of data we send for backwards compatibility. But we can sort the data so it maintains watch invariants. As advised those calls should have been using list+watch anyways (they can switch at any time) and fix for both the watch cache and etcd read is 2 LOC + tests. Given all those reasons - why wouldn’t we just trade valid data for a bit of performance? Much better then returning invalid data in shorter time.

Long term Watch shouldn’t combine LIST+WATCH calls in one method, not even with a sentinel. Watch has different set of invariants then list. I could see the reason that it makes it easy to write Until loops also exercising the condition on the initial state, but the same can be done on client side with ListWatchUntil. I guess the separation makes it also clearer w.r.t. RBAC. We should deprecate the old list behaviour in watch with rv 0.

I think we all agree that we want to change that list behaviour and deprecate it in the future but we also need to solve the invalid data we return NOW.

Also technically how do we deprecate a subset of method arguments for WATCH functions before we start returning error code there? What timeline are we talking about?

We have not offered that guarantee in the past.

I couldn’t find any disclaimer in the API that you loose ordering based on RV value given to WATCH.

The general RV ordering is implied by:

Watches are long-running requests and use gRPC streams to stream event data. A watch stream is bi-directional; the client writes to the stream to establish watches and reads to receive watch event. A single watch stream can multiplex many distinct watches by tagging events with per-watch identifiers. This multiplexing helps reducing the memory footprint and connection overhead on the core etcd cluster.

Watches make three guarantees about events:

Ordered - events are ordered by revision; an event will never appear on a watch if it precedes an event in time that has already been posted.
Reliable - a sequence of events will never drop any subsequence of events; if there are events ordered in time as a < b < c, then if the watch receives events a and c, it is guaranteed to receive b.
Atomic - a list of events is guaranteed to encompass complete revisions; updates in the same revision over multiple keys will not be split over several lists of events.