etcd: gRPC code Unavailable instead Canceled

After this PR https://github.com/etcd-io/etcd/pull/9178 we get:

It’s in etcd running with parameter --log-package-levels=debug:

...
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL")
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 78; CANCEL")
...

in my prometheus i see:

sum by(grpc_type, grpc_service, grpc_code) (grpc_server_handled_total{grpc_code="Unavailable", job="etcd", grpc_type="bidi_stream"}) = 4172

But it seems to me, this not error, and should not fall under Unavailable Code

Unavailable indicates the service is currently unavailable. This is a most likely a transient condition and may be corrected by retrying with a backoff. See litmus test above for deciding between FailedPrecondition, Aborted, and Unavailable.

It looks like Canceled Code

Canceled indicates the operation was canceled (typically by the caller).

Now everyone who uses prometheus operator + alertmanager, get this alert, because CANCEL falls under Unavailable

#9725, #9576, #9166 https://github.com/openshift/origin/issues/20311, https://github.com/coreos/prometheus-operator/issues/2046, and in google groups

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 13
Comments: 29 (18 by maintainers)

Commits related to this issue

manifests: Remove etcd gRPC calls failed alerts These alerts are firing constantly due to some issues around how to etcd increases its metrics. See https://github.com/etcd-io/etcd/issues/10289 — committed to brancz/cluster-monitoring-operator by brancz 5 years ago
Stop logging client disconnect as a pilot error Right now when Envoy shuts down it sometimes sends an Unavailable error code instead of Cancelled - see https://github.com/etcd-io/etcd/issues/10289 fo... — committed to howardjohn/istio by howardjohn 4 years ago
Stop logging client disconnect as a pilot error (#19882) * Stop logging client disconnect as a pilot error Right now when Envoy shuts down it sometimes sends an Unavailable error code instead of Can... — committed to istio/istio by howardjohn 4 years ago
etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago

Most upvoted comments

Any news on merging fixes?!

+15

Imunhatep on Apr 14, 2020

This could solve some ongoing issues with Prometheus monitoring, will this PR be ever merged? 😃

+10

immanuelfodor on Dec 12, 2020

I have found the source of Unavailable instead of Canceled.

https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/watch.go#L198-L201

When the watch is canceled we pass error

https://github.com/etcd-io/etcd/blob/63dd73c1869f1784f907b922f61571176a2802e8/etcdserver/api/v3rpc/rpctypes/error.go#L66

Which is not valid resulting in codes.Unavailable passed to status.New() leading to metrics reporting Unavailable. In my local testing adding a new error using codes.Canceled allows the metrics to properly report.

I will raise a PR in a bit to resolve.

hexfusion on Nov 19, 2019

@hexfusion Is there any progress about this issue ? Certain etcd3 alert rules are being triggered because of this issue https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules.yml#L32 ?

hectorj2f on May 31, 2019

Thanks for digging @spzala I am going to dig on this further.

hexfusion on Mar 5, 2019

Update: @menghanl was right about the source of the issue the actual inbound error to mapRecvMsgError is.

rpc error: code = Unknown desc = client disconnected

I will connect the rest of the dots and come up with some resolution hopefully soon.

hexfusion on Apr 28, 2019

I am picking this back up now, hopefully for the last time to resolve.

hexfusion on Nov 13, 2019

I have thought about this for a while and although it does feel like there could be a middle ground between Unknown and Unavailable error codes in this situation. For example by allowing Unknown to conditionally remain as the error code similar to [1] vs forcing Unavailable. If we look at the gRPC spec [1] the client generated Unavailable seems sane vs server. So perhaps this is not a matter of using the wrong error code but a result of gRPC having 2 server transports [3]? @menghanl curious your thoughts?

/cc @brancz

Some data transmitted (e.g., request metadata written to TCP connection) before connection breaks	UNAVAILABLE	Client

Server shutting down …	UNAVAILABLE	Server

[1]https://github.com/grpc/grpc-go/blob/66cd5249103cfe4638864f08462537504b07ff1d/internal/transport/handler_server.go#L426 [2] https://github.com/grpc/grpc/blob/master/doc/statuscodes.md [3] https://github.com/grpc/grpc-go/issues/1819#issuecomment-360648206 (won’t fix)

hexfusion on Apr 29, 2019

@hexfusion I sure will, thanks!

Quick update - I had to spend some time creating a cluster with tls but I have it created now, so should be providing more update soon. Thanks!

spzala on Jan 27, 2019

@hexfusion I sure will, thanks!

spzala on Jan 21, 2019

@spzala could you please take a look?

hexfusion on Jan 21, 2019

Yes, @hexfusion thanks for notifying, there is no free time to do it now. Maybe someone will help?

@Arslanbekov I will try to take a look soon.

hexfusion on Dec 7, 2018