etcd: gRPC code Unavailable instead Canceled
After this PR https://github.com/etcd-io/etcd/pull/9178 we get:
It’s in etcd running with parameter --log-package-levels=debug
:
...
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 61; CANCEL")
etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 78; CANCEL")
...
in my prometheus i see:
sum by(grpc_type, grpc_service, grpc_code) (grpc_server_handled_total{grpc_code="Unavailable", job="etcd", grpc_type="bidi_stream"}) = 4172
But it seems to me, this not error, and should not fall under Unavailable Code
Unavailable indicates the service is currently unavailable. This is a most likely a transient condition and may be corrected by retrying with a backoff. See litmus test above for deciding between FailedPrecondition, Aborted, and Unavailable.
It looks like Canceled Code
Canceled indicates the operation was canceled (typically by the caller).
Now everyone who uses prometheus operator + alertmanager, get this alert, because CANCEL falls under Unavailable
#9725, #9576, #9166 https://github.com/openshift/origin/issues/20311, https://github.com/coreos/prometheus-operator/issues/2046, and in google groups
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 13
- Comments: 29 (18 by maintainers)
Commits related to this issue
- manifests: Remove etcd gRPC calls failed alerts These alerts are firing constantly due to some issues around how to etcd increases its metrics. See https://github.com/etcd-io/etcd/issues/10289 — committed to brancz/cluster-monitoring-operator by brancz 5 years ago
- Stop logging client disconnect as a pilot error Right now when Envoy shuts down it sometimes sends an Unavailable error code instead of Cancelled - see https://github.com/etcd-io/etcd/issues/10289 fo... — committed to howardjohn/istio by howardjohn 4 years ago
- Stop logging client disconnect as a pilot error (#19882) * Stop logging client disconnect as a pilot error Right now when Envoy shuts down it sometimes sends an Unavailable error code instead of Can... — committed to istio/istio by howardjohn 4 years ago
- etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
- etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
- etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
- etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
- etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
- etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
- etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
- etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
- etcdserver: fix incorrect metrics generated when clients cancel watches Before this patch, a client which cancels the context for a watch results in the server generating a `rpctypes.ErrGRPCNoLeader`... — committed to ironcladlou/etcd by ironcladlou 4 years ago
Any news on merging fixes?!
This could solve some ongoing issues with Prometheus monitoring, will this PR be ever merged? 😃
I have found the source of
Unavailable
instead ofCanceled
.https://github.com/etcd-io/etcd/blob/master/etcdserver/api/v3rpc/watch.go#L198-L201
When the watch is canceled we pass error
https://github.com/etcd-io/etcd/blob/63dd73c1869f1784f907b922f61571176a2802e8/etcdserver/api/v3rpc/rpctypes/error.go#L66
Which is not valid resulting in
codes.Unavailable
passed to status.New() leading to metrics reportingUnavailable
. In my local testing adding a new error usingcodes.Canceled
allows the metrics to properly report.I will raise a PR in a bit to resolve.
@hexfusion Is there any progress about this issue ? Certain etcd3 alert rules are being triggered because of this issue https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/etcd3_alert.rules.yml#L32 ?
Thanks for digging @spzala I am going to dig on this further.
Update: @menghanl was right about the source of the issue the actual inbound error to mapRecvMsgError is.
rpc error: code = Unknown desc = client disconnected
I will connect the rest of the dots and come up with some resolution hopefully soon.
I am picking this back up now, hopefully for the last time to resolve.
I have thought about this for a while and although it does feel like there could be a middle ground between
Unknown
andUnavailable
error codes in this situation. For example by allowingUnknown
to conditionally remain as the error code similar to [1] vs forcingUnavailable
. If we look at the gRPC spec [1] the client generatedUnavailable
seems sane vs server. So perhaps this is not a matter of using the wrong error code but a result of gRPC having 2 server transports [3]? @menghanl curious your thoughts?/cc @brancz
[1]https://github.com/grpc/grpc-go/blob/66cd5249103cfe4638864f08462537504b07ff1d/internal/transport/handler_server.go#L426 [2] https://github.com/grpc/grpc/blob/master/doc/statuscodes.md [3] https://github.com/grpc/grpc-go/issues/1819#issuecomment-360648206 (won’t fix)
Quick update - I had to spend some time creating a cluster with tls but I have it created now, so should be providing more update soon. Thanks!
@hexfusion I sure will, thanks!
@spzala could you please take a look?
@Arslanbekov I will try to take a look soon.