istio: 503 NR when rollout and restart istiod

(NOTE: This is used to report product bugs: To report a security vulnerability, please visit https://istio.io/about/security-vulnerabilities To ask questions about how to use Istio, please visit https://discuss.istio.io)

Bug description

We are using Istio 1.7.0 in our production environment. When we rollout to restart Istiod, all the proxies are disconnected and reconnected to newly created Istiod pod. During the process, we can see 503 NR errors when there are consistent requests from the proxy to other service(s) within the mesh. From envoy log, we notice it happened at certain time in “envoy cluster warm-up phase”.

[ ] Docs [ ] Installation [X] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure [X] Upgrade

Expected behavior

No 503 NR errors during istiod rollout and restart process

Steps to reproduce the bug

  1. Constantly send requests to a service in mesh from a proxy. For example:

while true ;do date +“%T” && kubectl exec -it -n default httpbin-777c6f578c-jx4fc -c httpbin – sh -c “curl -X POST -I http://xx-api.default/api/yy”; done;

  1. Rollout restart istiod, this should trigger proxy disconnect and reconnect to istiod.

kubectl rollout restart deployment/istiod -n istio-system

  1. watch logs of the proxy sending requests:

kubectl logs -n default -f httpbin-777c6f578c-jx4fc -c istio-proxy --tail=10

From the request log we can see some 503 NR errors like this:

{“protocol”:“HTTP/1.1”,“upstream_service_time”:“-”,“upstream_local_address”:“-”,“duration”:“0”,“upstream_transport_failure_reason”:“-”,“route_name”:“-”,“downstream_local_address”:“...:80”,“user_agent”:“curl/7.64.0”,“response_code”:“503”,“response_flags”:“NR”,“start_time”:“2020-12-11T08:51:31.889Z”,“method”:“POST”,“request_id”:“3a39b6e3-cab6-4414-89da-ac4c06530a4e”,“upstream_host”:“-”,“x_forwarded_for”:“-”,“requested_server_name”:“-”,“bytes_received”:“0”,“istio_policy_status”:“-”,“bytes_sent”:“0”,“upstream_cluster”:“-”,“downstream_remote_address”:“10.28.97.29:59884”,“authority”:“xx-api.default”,“path”:“/api/yy”}

From the debug log we can see some cluster not found error like this:

2020-12-11T08:51:32.455166Z debug envoy conn_handler [C107665] new connection 2020-12-11T08:51:32.455196Z debug envoy http [C107665] new stream 2020-12-11T08:51:32.455226Z debug envoy http [C107665][S14709698978255946193] request headers >complete (end_stream=true): ‘:authority’, ‘xx-api.default’ ‘:path’, ‘/api/yy’ ‘:method’, ‘POST’ ‘user-agent’, ‘curl/7.64.0’ ‘accept’, ‘/

2020-12-11T08:51:32.455230Z debug envoy http [C107665][S14709698978255946193] request end stream 2020-12-11T08:51:32.455295Z debug envoy filter cannot find cluster outbound|80|normal|xx-api.default.svc.cluster.local 2020-12-11T08:51:32.455310Z debug envoy router [C107665][S14709698978255946193] unknown cluster ‘outbound|80|normal|xx-api.default.svc.cluster.local’ 2020-12-11T08:51:32.455316Z debug envoy http [C107665][S14709698978255946193] Sending local reply with details cluster_not_found 2020-12-11T08:51:32.455337Z debug envoy http [C107665][S14709698978255946193] encoding headers via codec (end_stream=true): ‘:status’, ‘503’ ‘date’, ‘Fri, 11 Dec 2020 08:51:31 GMT’ ‘server’, ‘envoy’

2020-12-11T08:51:32.455660Z debug envoy connection [C107665] remote close 2020-12-11T08:51:32.455666Z debug envoy connection [C107665] closing socket: 0 2020-12-11T08:51:32.455707Z debug envoy conn_handler [C107665] adding to cleanup list

We also see cluster warm up info from the log and notice 503 NR happened at certain phase of cluster warm up process.

2020-12-11T08:51:24.760182Z debug envoy upstream add/update cluster outbound|80|normal|xx-api.default.svc.cluster.local starting warming … 2020-12-11T08:51:32.927901Z debug envoy upstream warming cluster outbound|80||outbound|80|normal|xx-api.default.svc.cluster.local complete

Version (include the output of istioctl version --remote and kubectl version --short and helm version --short if you used Helm)

istio version: 1.7.0 kubernete version: 1.16.3

How was Istio installed?

istioctl install

Environment where the bug was observed (cloud vendor, OS, etc)

Tencent Kubernetes Engine (TKE)

Additionally, please consider running istioctl bug-report and attach the generated cluster-state tarball to this issue. Refer cluster state archive for more details.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (9 by maintainers)

Most upvoted comments

But the issue is clear, istiod is not sending full config at the start:

1366:2020-12-15T06:55:24.483627Z        info    ads     CDS: PUSH for node:spring-traffic-server-normal-5fbb954ff-w6ltl.devops clusters:6 services:1 version:2020-12-15T06:55:20Z/0
1522:2020-12-15T06:55:47.900311Z        info    ads     CDS: PUSH for node:spring-traffic-server-normal-5fbb954ff-w6ltl.devops clusters:2910 services:1189 version:2020-12-15T06:55:31Z/1

Since It’s not easy to upgrade to latest Istio 1.9, we finally figure out a workaround to overcome this. We delay the healthy status of istiod to give it enough time to warmup before pushing config by set it’s readinessProbe.initialDelaySeconds to 60 seconds:

readinessProbe:
    failureThreshold: 3
    httpGet:
       path: /ready
       port: 8080
       scheme: HTTP
    initialDelaySeconds: 60
    periodSeconds: 3
    successThreshold: 1
    timeoutSeconds: 5

Hope this helps.

Thank you, John. Here is the envoy log:503NR-envoy.log.gz

The request service name is xx-api, and you can search path “/api/yy”, which are my requests during the restart process. You can find 1 503 NR error in the logs.

For your information, here are some key events I picked up from the log:

2020-12-14T02:29:23.036025Z debug envoy upstream add/update cluster outbound|80||xx-api.default.svc.cluster.local starting warming 2020-12-14T02:29:29.010731Z debug envoy conn_handler [C336079] new connection //matching PassthroughCluster 2020-12-14T02:29:30.065693Z debug envoy conn_handler [C336084] new connection //matching PassthroughCluster 2020-12-14T02:29:31.152977Z debug envoy conn_handler [C336086] new connection //no match. 503 NR 2020-12-14T02:29:31.203611Z debug envoy upstream warming cluster outbound|80||xx-api.default.svc.cluster.local complete 2020-12-14T02:29:32.690389Z debug envoy conn_handler [C336088] new connection //matching outbound|80|normal|xx-api.default.svc.cluster.local