istio: Connection returns TLS / Certificate verification error in proxy after enabling SDS
Connection returns TLS / Certificate verification error in proxy after enabling SDS
Hi, I am having a problem with istio in my current production setup and would need your help to troubleshoot it.
Bug Description
Background
I am running Istio 1.1.7 in all our environments on kubernetes (amazon eks) 1.12.7 with mtls enable on application namespace, sds in both ingress gateway and sidecar. There is no circuit breaker, no custom root CA for citadel.
Problem The behaviour I saw is at first, all services in cluster are working fine, connection from ingress controller hit the services and return correctly.
But after a while, days or weeks, i haven’t been able to find the pattern, all connections from ingress to services return 503 UF, URX. There are logs in istio-proxy container of ingress pod but no log in the upstream service’s istio-proxy container.
In example log (sorry for the format, i pull it out from elasticsearch)
"stream_name": "istio-ingressgateway-76749b4bb4-z6n78",
"istio_policy_status": "-",
"bytes_sent": "91",
"upstream_cluster": "outbound|8080||frontend.services.svc.cluster.local",
"downstream_remote_address": "172.23.24.174:30690",
"path": "/user",
"authority": "prod.example.com",
"protocol": "HTTP/1.1",
"upstream_service_time": "-",
"upstream_local_address": "-",
"duration": "69",
"downstream_local_address": "172.23.24.189:443",
"response_code": "503",
"user_agent": "Mozilla/5.0 (Linux; Android 8.0.0) ...",
"response_flags": "UF,URX",
"start_time": "2019-06-03T13:26:06.617Z",
"method": "GET",
"request_id": "320037db-601b-9c52-861f-bwoeifwoiegi",
"upstream_host": "172.23.24.143:80",
"x_forwarded_for": "218.186.146.112,172.23.24.174",
"requested_server_name": "prod.example.com",
"bytes_received": "0",
I tried to enable debug logging in proxy sidecar with
curl -XPOST localhost:15000/logging?connection=debug
then i found this in the isito-proxy container of the ingress controller:
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:644] [C79846] connecting to 172.23.14.229:80
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:517] [C79846] connected
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:653] [C79846] connection in progress
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 2
[2019-05-21 08:18:36.883][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 2
[2019-05-21 08:18:36.883][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 2
[2019-05-21 08:18:36.885][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 1
[2019-05-21 08:18:36.885][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C79846] TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED
[2019-05-21 08:18:36.885][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C79846] closing socket: 0
So it looks like there are some problem with the TLS cert. The cert in istio-ca-secret and istio.istio-ingressgateway-service-account look correct and are not expired yet. Same goes for the internal certificates for my upstream services.
And as far as I can tell, this only happens when the service pods runs for a few days without being restarted or deployed with a new version.
I also saw another instance of the problem, but these logs were found inside the upstream service’s istio-proxy container, and the TLS error is different from the one in the ingress controller:
[2019-06-04 01:18:58.029][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C400] handshake error: 2
[2019-06-04 01:18:58.029][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C400] handshake error: 2
[2019-06-04 01:18:58.031][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C400] handshake error: 1
[2019-06-04 01:18:58.031][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C400] TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2019-06-04 01:18:58.031][32][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C400] closing socket: 0
I am not sure what actually happened here; The citadel logs, node agent logs and the rest looked normal at that point in time.
Please let me know if there are any other logs/config you need to troubleshoot the problem.
Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure [ ] Docs [ ] Installation [X] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [X] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastrcture
Steps to reproduce the bug
As mentioned previously, we still don’t know how exactly to reproduce it, but based these are the patterns we observed so far:
- Enable SDS
- Deploy applications and let them run for 2 - 4 days without any re-deployment / bouncing
- Observe for 503 errors on the proxy logs
Version (include the output of istioctl version --remote
and kubectl version
)
Istio 1.1.7 in all our environments
Kubernetes (AWS EKS) 1.12.7
How was Istio installed? We installed it using Helm, via Tiller
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 19 (6 by maintainers)
faced the same issue with istio 1.9, seems there is an issue in ingressgateway - After I forcibly killed its pod (restarted after termination), I found the issue disappeared
Faced the same issue, a restart of the ingress pods and deployments fixed the issue. This issue requires more attention and investigation from the development side. This is certainly a reoccurring issue among users.
We’re running into a similar issue when running in
k3d
:Note: this is a vanilla installation out of the box via
istioctl install --skip-confirmation
Running
istioctl proxy-config secret <pod>
for all pods showed valid certificates.Versions:
Workaround is to cycle the ingress gateway as others have noted: