istio: Randomly get 503 UR upstream_reset_before_response_started{remote_reset}

Is this the right place to submit this?

This is not a security vulnerability or a crashing bug
This is not a question about how to use Istio

Bug Description

We are using Istio since 2 years in production. Lately an issue keep coming more and more.

What happen is everything work fine and at some point an istio proxy container get 503 UR upstream_reset_before_response_started{remote_reset}

And after that the traffic is erratic. Making 50-70% of the calls failed. Only on the pod where the error happen. Other pods (we have at least 6 replicas per deployment) still work correctly.

After a rollout restart everything working fine again until the next error. Next error can happen 20 minutes after or 2 months after.

Here is an example where the error happen 3 times and we restarted it 2 (17:00 and 17:20) times. No reboot after to help investigate without success:

Screenshot from 2023-09-19 12-35-53

Here is the first 503 at each of this times and the last 200:

Premier pool 503 :

[2023-09-18T14:26:34.978Z] "POST /services.Controller/List HTTP/2" 200 - via_upstream - "-" 100 40332 67 66 "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "575ab642-e2c6-9706-830f-a5c6d2e9a042" "server" "x.x.x.x:port" inbound|port|| x.x.x.x:port x.x.x.x:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=2b34ee377aa11c70d229d6f7bd9af2da

[2023-09-18T14:26:35.028Z] "POST /services.Controller/Version HTTP/2" 503 UR upstream_reset_before_response_started{remote_reset} - "-" 8 0 1 - "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "bb991028-c585-928b-a5e8-cf6297739272" "server" "x.x.x.x:port" inbound|port|| x.x.x.x:port x.x.x.x:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=426881a1f60826c2d999896dbdaf5d7d

[2023-09-18T15:10:20.217Z] "POST /services.Controller/Retrieve HTTP/2" 200 - via_upstream - "-" 60 208 60 60 "10.40.168.188,x.x.x.x,10.42.254.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "ae374654-4c3f-9ce8-a826-2606df0583a8" "server" "10.42.255.59:port" inbound|port|| x.x.x.x:53209 x.x.x.x:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=8cfb61442bf16a912eb340909aab91a0

[2023-09-18T15:10:20.374Z] "POST /services.Controller/Retrieve HTTP/2" 503 UR upstream_reset_before_response_started{remote_reset} - "-" 60 0 1 - "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "b086ec83-dad7-9462-af2a-7f27dc960ebf" "server" "10.42.255.59:port" inbound|port|| x.x.x.x:port 10.42.255.59:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=142edfc9f46c46ab210862e120d4be5f

[2023-09-18T15:27:10.535Z] "HEAD / HTTP/2" 503 UR upstream_reset_before_response_started{remote_reset} - "-" 0 0 0 - "65.154.226.171,x.x.x.x,x.x.x.x" "-" "32d94e0d-4973-962d-98e4-c57d535e1a96" "server" "10.42.13.141:port" inbound|port|| x.x.x.x:49999 10.42.13.141:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=00d815a008582b3ed6682676b3fb8a65

[2023-09-18T15:27:10.525Z] "POST /services.Controller/List HTTP/2" 200 - via_upstream - "-" 124 67308 117 116 "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "486e4c88-f3b2-9c8e-a70e-f90787f4b6f4" "server" "x.x.x.x:port" inbound|port|| x.x.x.x:36531 x.x.x.x:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=9b89298f6c07194806d7551b0954184c

This error happen on multiples namespaces. This error happen more and more with time. We receive more request now with new products on our µservice architecture but the operation number/s when working and not working is not too different but a high load can lead to an issue that break something and is only fixed by a restart.

We are using istio as a proxy for grpc and grpc-web.

We need help to investigate further. We added log for grpc server to look it grpc is having an internal issue but as it happen in all our services randomly it look more like an Istio issue.

It can be related to what it said to this comment: https://github.com/envoyproxy/envoy/issues/14981#issuecomment-776079153 but the connection is done by grpc not our code so do not know how to check more into it.

It can also come from data transfer too big that was expected.

Any help on how to have more information why istio stop working correctly at some time will be greatly appreciated.

I have read some post on tcpKeepalive and idleTimeout but I don’t see how it can raise the issue only on some on our pods and not all.

Version

$ istioctl version
client version: 1.19.0
control plane version: 1.17.5
data plane version: 1.17.5 (161 proxies)
$ kubectl version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.4+rke2r1

Additional Information

This is the link to istioctl bug-report at it weight 26 Mo: https://we.tl/t-MT1vJCO0qE

We have two gateways , two virtual services and two services that point to the same deployment.

adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get gateways.networking.istio.io 
NAME                                        AGE
namespace-name-back-grpc       82d
namespace-name-back-grpc-web   82d
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get services
NAME                                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
namespace-name-back-grpc       ClusterIP   10.43.73.183    <none>        80/TCP    82d
namespace-name-back-grpc-web   ClusterIP   10.43.133.64    <none>        80/TCP    82d
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get vs
NAME                                        GATEWAYS                                        HOSTS                                                            AGE
namespace-name-back-grpc       ["namespace-name-back-grpc"]       ["back-grpc.server"]       82d
namespace-name-back-grpc-web   ["namespace-name-back-grpc-web"]   ["back-grpc-web.server"]   82d

Here is the gateways:

adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get vs namespace-name-back-grpc -o yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  annotations:
    meta.helm.sh/release-name: namespace-name
    meta.helm.sh/release-namespace: namespace-name
  creationTimestamp: "2023-06-28T16:39:59Z"
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: namespace-name
    helm.sh/chart: chart-ng-3.7.0
    helm.toolkit.fluxcd.io/name: application
    helm.toolkit.fluxcd.io/namespace: namespace-name
  name: namespace-name-back-grpc
  namespace: namespace-name
  resourceVersion: "81065415"
  uid: 539a168c-11e5-489f-9619-4062d361ef8d
spec:
  gateways:
  - namespace-name-back-grpc
  hosts:
  - back-grpc.server
  http:
  - match:
    - uri:
        prefix: /
    route:
    - destination:
        host: namespace-name-back-grpc
        port:
          number: 80


adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get vs namespace-name-back-grpc-web -o yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  annotations:
    meta.helm.sh/release-name: namespace-name
    meta.helm.sh/release-namespace: namespace-name
  creationTimestamp: "2023-06-28T16:39:59Z"
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: namespace-name
    helm.sh/chart: chart-ng-3.7.0
    helm.toolkit.fluxcd.io/name: application
    helm.toolkit.fluxcd.io/namespace: namespace-name
  name: namespace-name-back-grpc-web
  namespace: namespace-name
  resourceVersion: "81065426"
  uid: 4f2e9ae4-4c0a-48f0-8d77-93a9f156f54e
spec:
  gateways:
  - namespace-name-back-grpc-web
  hosts:
  - back-grpc-web.server
  http:
  - corsPolicy:
      allowCredentials: true
      allowHeaders:
      - grpc-timeout
      - content-type
      - keep-alive
      - user-agent
      - cache-control
      - content-type
      - content-transfer-encoding
      - custom-header-1
      - x-accept-content-transfer-encoding
      - x-accept-response-streaming
      - x-user-agent
      - x-grpc-web
      - filters
      - pagination
      - headers
      - traceparent
      allowMethods:
      - POST
      - GET
      - OPTIONS
      - PUT
      - DELETE
      allowOrigin:
      - '*'
      exposeHeaders:
      - custom-header-1
      - grpc-status
      - grpc-message
      - filters
      - pagination
      - headers
      maxAge: 1728s
    match:
    - uri:
        prefix: /
    route:
    - destination:
        host: namespace-name-back-grpc-web
        port:
          number: 80

Here is the services:

adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get services namespace-name-back-grpc -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: namespace-name
    meta.helm.sh/release-namespace: namespace-name
  creationTimestamp: "2023-06-28T16:39:59Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: namespace-name
    helm.sh/chart: chart-ng-3.7.0
    helm.toolkit.fluxcd.io/name: application
    helm.toolkit.fluxcd.io/namespace: namespace-name
  name: namespace-name-back-grpc
  namespace: namespace-name
  resourceVersion: "81065206"
  uid: 84d611f6-9c0f-4fe3-9a96-2669eb56c716
spec:
  clusterIP: x.x.x.x
  clusterIPs:
  - x.x.x.x
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: grpc
    port: port0
    protocol: TCP
    targetPort: port1
  selector:
    app: namespace-name-grpc-api
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}


adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get services namespace-name-back-grpc-web -o yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: namespace-name
    meta.helm.sh/release-namespace: namespace-name
  creationTimestamp: "2023-06-28T16:39:59Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: namespace-name
    helm.sh/chart: chart-ng-3.7.0
    helm.toolkit.fluxcd.io/name: application
    helm.toolkit.fluxcd.io/namespace: namespace-name
  name: namespace-name-back-grpc-web
  namespace: namespace-name
  resourceVersion: "81065221"
  uid: 5f601e92-21bf-4133-8613-d1572cfd139d
spec:
  clusterIP: x.x.x.x
  clusterIPs:
  - x.x.x.x
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: grpc-web
    port: port0
    protocol: TCP
    targetPort: port1
  selector:
    app: namespace-name-grpc-api
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Here is the gateways:

adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get gateways.networking.istio.io namespace-name-back-grpc -o yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  annotations:
    meta.helm.sh/release-name: namespace-name
    meta.helm.sh/release-namespace: namespace-name
  creationTimestamp: "2023-06-28T16:39:59Z"
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: namespace-name
    helm.sh/chart: chart-ng-3.7.0
    helm.toolkit.fluxcd.io/name: application
    helm.toolkit.fluxcd.io/namespace: namespace-name
  name: namespace-name-back-grpc
  namespace: namespace-name
  resourceVersion: "81065337"
  uid: 34b93768-61a1-4559-9ce6-333855361port08
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - back-grpc.application.server
    port:
      name: http
      number: port0
      protocol: HTTP2
  - hosts:
    - back-grpc.application.server
    port:
      name: https
      number: port3
      protocol: HTTPS
    tls:
      credentialName: namespace-name-back-grpc
      mode: SIMPLE


adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get gateways.networking.istio.io namespace-name-back-grpc-web -o yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  annotations:
    meta.helm.sh/release-name: namespace-name
    meta.helm.sh/release-namespace: namespace-name
  creationTimestamp: "2023-06-28T16:39:59Z"
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: namespace-name
    helm.sh/chart: chart-ng-3.7.0
    helm.toolkit.fluxcd.io/name: application
    helm.toolkit.fluxcd.io/namespace: namespace-name
  name: namespace-name-back-grpc-web
  namespace: namespace-name
  resourceVersion: "81065349"
  uid: 55d4bd3d-a7port0-4b1d-8cfb-6d128236074c
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - back-grpc-web.application.server
    port:
      name: http
      number: port0
      protocol: HTTP2
  - hosts:
    - back-grpc-web.application.server
    port:
      name: https
      number: port3
      protocol: HTTPS
    tls:
      credentialName: namespace-name-back-grpc-web
      mode: SIMPLE

About this issue

Original URL
State: closed
Created 9 months ago
Comments: 24 (11 by maintainers)

Most upvoted comments

Let’s wait to see any new findings

hzxuzhonghu on Sep 21, 2023

There is no log in the server. We added more verbosity to the grpc server and waiting for the bug to happen again. Will send the detailled log when happening again.

For any informations here is the documentation of gRPC: https://grpc.io/docs/guides/keepalive/ We didn’t change default settings on this parameters except in only one of our service: (“grpc.keepalive_time_ms”, 60000),

But the issue happen in other services that this one.

We may have changed some parameter on front haproxy to avoid timeout. We have long PDF generation stream request that can last more than 3 minutes.

Issue happen also in small unary call.

AMontagu on Sep 20, 2023