core: cr-syncer does not retry on failed updates

We find quite a number of these errors in our cr-syncer logs.

2021/01/26 16:40:16 Syncing key "default/1710.2053438" from queue "downstream" failed: WarehouseOrder default/1710.2053438 @ 184236994: update status failed: Operation cannot be fulfilled on warehouseorders.ewm.sap.com "1710.2053438": the object has been modified; please apply your changes to the latest version and try again

cr-syncer does not retry applying the change as the error message suggests. If there are other update events from downstream cluster, it will get in sync again. But if there are no more change events from downstream for that CR it remains in a not synced state. When a CR is in this state it even does sync changes from upstream cluster anymore. It remains in this inconsistent state until cr-syncer is restarted.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (21 by maintainers)

Commits related to this issue

Most upvoted comments

Good news, I think I got to the bottom of this. curl behaves different because it is only doing one streaming request. When two or more streaming requests are going through a single HTTP/2 connection, the ingress will shutdown differently: it immediately sends GOAWAY (Go seems to ignore this) and then drops the connection (without a TLS close-notify or TCP FIN) ten seconds later. 30s later, Go’s TCP keepalive notices this, but that doesn’t seem to propagate through the http2 later, and so client-go doesn’t see it and restart the watch.

Upgrading nginx is sufficient to improve the behavior, although that won’t help when the connection is dropped for real due to lost connectivity.

I’ll send the following changes for internal review: