istio: 503 NR when rollout and restart istiod
(NOTE: This is used to report product bugs: To report a security vulnerability, please visit https://istio.io/about/security-vulnerabilities To ask questions about how to use Istio, please visit https://discuss.istio.io)
Bug description
We are using Istio 1.7.0 in our production environment. When we rollout to restart Istiod, all the proxies are disconnected and reconnected to newly created Istiod pod. During the process, we can see 503 NR errors when there are consistent requests from the proxy to other service(s) within the mesh. From envoy log, we notice it happened at certain time in “envoy cluster warm-up phase”.
[ ] Docs [ ] Installation [X] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure [X] Upgrade
Expected behavior
No 503 NR errors during istiod rollout and restart process
Steps to reproduce the bug
- Constantly send requests to a service in mesh from a proxy. For example:
while true ;do date +“%T” && kubectl exec -it -n default httpbin-777c6f578c-jx4fc -c httpbin – sh -c “curl -X POST -I http://xx-api.default/api/yy”; done;
- Rollout restart istiod, this should trigger proxy disconnect and reconnect to istiod.
kubectl rollout restart deployment/istiod -n istio-system
- watch logs of the proxy sending requests:
kubectl logs -n default -f httpbin-777c6f578c-jx4fc -c istio-proxy --tail=10
From the request log we can see some 503 NR errors like this:
{“protocol”:“HTTP/1.1”,“upstream_service_time”:“-”,“upstream_local_address”:“-”,“duration”:“0”,“upstream_transport_failure_reason”:“-”,“route_name”:“-”,“downstream_local_address”:“...:80”,“user_agent”:“curl/7.64.0”,“response_code”:“503”,“response_flags”:“NR”,“start_time”:“2020-12-11T08:51:31.889Z”,“method”:“POST”,“request_id”:“3a39b6e3-cab6-4414-89da-ac4c06530a4e”,“upstream_host”:“-”,“x_forwarded_for”:“-”,“requested_server_name”:“-”,“bytes_received”:“0”,“istio_policy_status”:“-”,“bytes_sent”:“0”,“upstream_cluster”:“-”,“downstream_remote_address”:“10.28.97.29:59884”,“authority”:“xx-api.default”,“path”:“/api/yy”}
From the debug log we can see some cluster not found error like this:
2020-12-11T08:51:32.455166Z debug envoy conn_handler [C107665] new connection 2020-12-11T08:51:32.455196Z debug envoy http [C107665] new stream 2020-12-11T08:51:32.455226Z debug envoy http [C107665][S14709698978255946193] request headers >complete (end_stream=true): ‘:authority’, ‘xx-api.default’ ‘:path’, ‘/api/yy’ ‘:method’, ‘POST’ ‘user-agent’, ‘curl/7.64.0’ ‘accept’, ‘/’
2020-12-11T08:51:32.455230Z debug envoy http [C107665][S14709698978255946193] request end stream 2020-12-11T08:51:32.455295Z debug envoy filter cannot find cluster outbound|80|normal|xx-api.default.svc.cluster.local 2020-12-11T08:51:32.455310Z debug envoy router [C107665][S14709698978255946193] unknown cluster ‘outbound|80|normal|xx-api.default.svc.cluster.local’ 2020-12-11T08:51:32.455316Z debug envoy http [C107665][S14709698978255946193] Sending local reply with details cluster_not_found 2020-12-11T08:51:32.455337Z debug envoy http [C107665][S14709698978255946193] encoding headers via codec (end_stream=true): ‘:status’, ‘503’ ‘date’, ‘Fri, 11 Dec 2020 08:51:31 GMT’ ‘server’, ‘envoy’
2020-12-11T08:51:32.455660Z debug envoy connection [C107665] remote close 2020-12-11T08:51:32.455666Z debug envoy connection [C107665] closing socket: 0 2020-12-11T08:51:32.455707Z debug envoy conn_handler [C107665] adding to cleanup list
We also see cluster warm up info from the log and notice 503 NR happened at certain phase of cluster warm up process.
2020-12-11T08:51:24.760182Z debug envoy upstream add/update cluster outbound|80|normal|xx-api.default.svc.cluster.local starting warming … 2020-12-11T08:51:32.927901Z debug envoy upstream warming cluster outbound|80||outbound|80|normal|xx-api.default.svc.cluster.local complete
Version (include the output of istioctl version --remote
and kubectl version --short
and helm version --short
if you used Helm)
istio version: 1.7.0 kubernete version: 1.16.3
How was Istio installed?
istioctl install
Environment where the bug was observed (cloud vendor, OS, etc)
Tencent Kubernetes Engine (TKE)
Additionally, please consider running istioctl bug-report
and attach the generated cluster-state tarball to this issue.
Refer cluster state archive for more details.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (9 by maintainers)
Since It’s not easy to upgrade to latest Istio 1.9, we finally figure out a workaround to overcome this. We delay the healthy status of istiod to give it enough time to warmup before pushing config by set it’s readinessProbe.initialDelaySeconds to 60 seconds:
Hope this helps.
Thank you, John. Here is the envoy log:503NR-envoy.log.gz
The request service name is xx-api, and you can search path “/api/yy”, which are my requests during the restart process. You can find 1 503 NR error in the logs.
For your information, here are some key events I picked up from the log: