istio: Connection returns TLS / Certificate verification error in proxy after enabling SDS

Connection returns TLS / Certificate verification error in proxy after enabling SDS

Hi, I am having a problem with istio in my current production setup and would need your help to troubleshoot it.

Bug Description

Background

I am running Istio 1.1.7 in all our environments on kubernetes (amazon eks) 1.12.7 with mtls enable on application namespace, sds in both ingress gateway and sidecar. There is no circuit breaker, no custom root CA for citadel.

Problem The behaviour I saw is at first, all services in cluster are working fine, connection from ingress controller hit the services and return correctly.

But after a while, days or weeks, i haven’t been able to find the pattern, all connections from ingress to services return 503 UF, URX. There are logs in istio-proxy container of ingress pod but no log in the upstream service’s istio-proxy container.

In example log (sorry for the format, i pull it out from elasticsearch)

"stream_name": "istio-ingressgateway-76749b4bb4-z6n78",
"istio_policy_status": "-",
"bytes_sent": "91",
"upstream_cluster": "outbound|8080||frontend.services.svc.cluster.local",
"downstream_remote_address": "172.23.24.174:30690",
"path": "/user",
"authority": "prod.example.com",
"protocol": "HTTP/1.1",
"upstream_service_time": "-",
"upstream_local_address": "-",
"duration": "69",
"downstream_local_address": "172.23.24.189:443",
"response_code": "503",
"user_agent": "Mozilla/5.0 (Linux; Android 8.0.0) ...",
"response_flags": "UF,URX",
"start_time": "2019-06-03T13:26:06.617Z",
"method": "GET",
"request_id": "320037db-601b-9c52-861f-bwoeifwoiegi",
"upstream_host": "172.23.24.143:80",
"x_forwarded_for": "218.186.146.112,172.23.24.174",
"requested_server_name": "prod.example.com",
"bytes_received": "0",

I tried to enable debug logging in proxy sidecar with

curl -XPOST localhost:15000/logging?connection=debug

then i found this in the isito-proxy container of the ingress controller:

[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:644] [C79846] connecting to 172.23.14.229:80
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:517] [C79846] connected
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:653] [C79846] connection in progress
[2019-05-21 08:18:36.878][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 2
[2019-05-21 08:18:36.883][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 2
[2019-05-21 08:18:36.883][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 2
[2019-05-21 08:18:36.885][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C79846] handshake error: 1
[2019-05-21 08:18:36.885][33][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C79846] TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED
[2019-05-21 08:18:36.885][33][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C79846] closing socket: 0

So it looks like there are some problem with the TLS cert. The cert in istio-ca-secret and istio.istio-ingressgateway-service-account look correct and are not expired yet. Same goes for the internal certificates for my upstream services.

And as far as I can tell, this only happens when the service pods runs for a few days without being restarted or deployed with a new version.

I also saw another instance of the problem, but these logs were found inside the upstream service’s istio-proxy container, and the TLS error is different from the one in the ingress controller:

[2019-06-04 01:18:58.029][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C400] handshake error: 2
[2019-06-04 01:18:58.029][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C400] handshake error: 2
[2019-06-04 01:18:58.031][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:142] [C400] handshake error: 1
[2019-06-04 01:18:58.031][32][debug][connection] [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:175] [C400] TLS error: 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2019-06-04 01:18:58.031][32][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C400] closing socket: 0

I am not sure what actually happened here; The citadel logs, node agent logs and the rest looked normal at that point in time.

Please let me know if there are any other logs/config you need to troubleshoot the problem.

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [X] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [X] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastrcture

Steps to reproduce the bug

As mentioned previously, we still don’t know how exactly to reproduce it, but based these are the patterns we observed so far:

Enable SDS
Deploy applications and let them run for 2 - 4 days without any re-deployment / bouncing
Observe for 503 errors on the proxy logs

Version (include the output of istioctl version --remote and kubectl version) Istio 1.1.7 in all our environments Kubernetes (AWS EKS) 1.12.7

How was Istio installed? We installed it using Helm, via Tiller

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 19 (6 by maintainers)

Most upvoted comments

faced the same issue with istio 1.9, seems there is an issue in ingressgateway - After I forcibly killed its pod (restarted after termination), I found the issue disappeared

msolimans on Apr 9, 2021

Faced the same issue, a restart of the ingress pods and deployments fixed the issue. This issue requires more attention and investigation from the development side. This is certainly a reoccurring issue among users.

CSimpiFoN on Dec 16, 2021

We’re running into a similar issue when running in k3d:

HTTP/1.1 503 Service Unavailable
content-length: 201
content-type: text/plain
date: Fri, 15 Oct 2021 20:02:30 GMT
server: istio-envoy

upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED

Note: this is a vanilla installation out of the box via istioctl install --skip-confirmation

Running istioctl proxy-config secret <pod> for all pods showed valid certificates.

Versions:

Istio v1.11.3
k8 v1.22.1

Workaround is to cycle the ingress gateway as others have noted:

kubectl -n istio-system delete pod $(kubectl -n istio-system get pod -lapp=istio-ingressgateway -ojsonpath='{.items[0].metadata.name}')

joncfoo on Oct 15, 2021