core: cr-syncer does not retry on failed updates

We find quite a number of these errors in our cr-syncer logs.

2021/01/26 16:40:16 Syncing key "default/1710.2053438" from queue "downstream" failed: WarehouseOrder default/1710.2053438 @ 184236994: update status failed: Operation cannot be fulfilled on warehouseorders.ewm.sap.com "1710.2053438": the object has been modified; please apply your changes to the latest version and try again

cr-syncer does not retry applying the change as the error message suggests. If there are other update events from downstream cluster, it will get in sync again. But if there are no more change events from downstream for that CR it remains in a not synced state. When a CR is in this state it even does sync changes from upstream cluster anymore. It remains in this inconsistent state until cr-syncer is restarted.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21 (21 by maintainers)

Commits related to this issue

Enable HTTP/2 connection health checking This limits the effect of a connection dropout by detecting it and reconnecting after ~30s. This is added to Kubernetes in v1.19.4, but I think we need to swi... — committed to drigz/core by drigz 3 years ago
Update to ingress-nginx v0.44.0 The newer version will gracefully close the connection to the cr-syncer when the config reloads, so this goes some way towards fixing #64 (although the cr-syncer shoul... — committed to drigz/core by drigz 3 years ago
Update to ingress-nginx v0.44.0 The newer version will gracefully close the connection to the cr-syncer when the config reloads, so this goes some way towards fixing #64 (although the cr-syncer shoul... — committed to googlecloudrobotics/core by drigz 3 years ago

Most upvoted comments

Good news, I think I got to the bottom of this. curl behaves different because it is only doing one streaming request. When two or more streaming requests are going through a single HTTP/2 connection, the ingress will shutdown differently: it immediately sends GOAWAY (Go seems to ignore this) and then drops the connection (without a TLS close-notify or TCP FIN) ten seconds later. 30s later, Go’s TCP keepalive notices this, but that doesn’t seem to propagate through the http2 later, and so client-go doesn’t see it and restart the watch.

Upgrading nginx is sufficient to improve the behavior, although that won’t help when the connection is dropped for real due to lost connectivity.

I’ll send the following changes for internal review:

https://github.com/drigz/core/tree/ingress-update-gh means the connection will be closed cleanly and the cr-syncer will reestablish immediately.
https://github.com/drigz/core/tree/cr-syncer-drop-gh means that the watch is restarted ~30s after the connection drops.

drigz on Feb 8, 2021