etcd: gRPC code Unavailable instead Canceled

After this PR https://github.com/etcd-io/etcd/pull/9178 we get:

It’s in etcd running with parameter --log-package-levels=debug:

...
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL")
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 78; CANCEL")
...

in my prometheus i see:

sum by(grpc_type, grpc_service, grpc_code) (grpc_server_handled_total{grpc_code="Unavailable", job="etcd", grpc_type="bidi_stream"}) = 4172

But it seems to me, this not error, and should not fall under Unavailable Code

Unavailable indicates the service is currently unavailable. This is a most likely a transient condition and may be corrected by retrying with a backoff. See litmus test above for deciding between FailedPrecondition, Aborted, and Unavailable.

It looks like Canceled Code

Canceled indicates the operation was canceled (typically by the caller).

Now everyone who uses prometheus operator + alertmanager, get this alert, because CANCEL falls under Unavailable

#9725, #9576, #9166 https://github.com/openshift/origin/issues/20311, https://github.com/coreos/prometheus-operator/issues/2046, and in google groups

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 13
  • Comments: 29 (18 by maintainers)

Commits related to this issue

Most upvoted comments

Any news on merging fixes?!

This could solve some ongoing issues with Prometheus monitoring, will this PR be ever merged? 😃

I have found the source of Unavailable instead of Canceled.

https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/watch.go#L198-L201

When the watch is canceled we pass error

https://github.com/etcd-io/etcd/blob/63dd73c1869f1784f907b922f61571176a2802e8/etcdserver/api/v3rpc/rpctypes/error.go#L66

Which is not valid resulting in codes.Unavailable passed to status.New() leading to metrics reporting Unavailable. In my local testing adding a new error using codes.Canceled allows the metrics to properly report.

I will raise a PR in a bit to resolve.

@hexfusion Is there any progress about this issue ? Certain etcd3 alert rules are being triggered because of this issue https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules.yml#L32 ?

Thanks for digging @spzala I am going to dig on this further.

Update: @menghanl was right about the source of the issue the actual inbound error to mapRecvMsgError is.

rpc error: code = Unknown desc = client disconnected

I will connect the rest of the dots and come up with some resolution hopefully soon.

I am picking this back up now, hopefully for the last time to resolve.

I have thought about this for a while and although it does feel like there could be a middle ground between Unknown and Unavailable error codes in this situation. For example by allowing Unknown to conditionally remain as the error code similar to [1] vs forcing Unavailable. If we look at the gRPC spec [1] the client generated Unavailable seems sane vs server. So perhaps this is not a matter of using the wrong error code but a result of gRPC having 2 server transports [3]? @menghanl curious your thoughts?

/cc @brancz

Some data transmitted (e.g., request metadata written to TCP connection) before connection breaks UNAVAILABLE Client
Server shutting down … UNAVAILABLE Server

[1]https://github.com/grpc/grpc-go/blob/66cd5249103cfe4638864f08462537504b07ff1d/internal/transport/handler_server.go#L426 [2] https://github.com/grpc/grpc/blob/master/doc/statuscodes.md [3] https://github.com/grpc/grpc-go/issues/1819#issuecomment-360648206 (won’t fix)

@hexfusion I sure will, thanks!

Quick update - I had to spend some time creating a cluster with tls but I have it created now, so should be providing more update soon. Thanks!

@hexfusion I sure will, thanks!

@spzala could you please take a look?

Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?

@Arslanbekov I will try to take a look soon.