istio: Randomly get 503 UR upstream_reset_before_response_started{remote_reset}
Is this the right place to submit this?
- This is not a security vulnerability or a crashing bug
- This is not a question about how to use Istio
Bug Description
We are using Istio since 2 years in production. Lately an issue keep coming more and more.
What happen is everything work fine and at some point an istio proxy container get 503 UR upstream_reset_before_response_started{remote_reset}
And after that the traffic is erratic. Making 50-70% of the calls failed. Only on the pod where the error happen. Other pods (we have at least 6 replicas per deployment) still work correctly.
After a rollout restart everything working fine again until the next error. Next error can happen 20 minutes after or 2 months after.
Here is an example where the error happen 3 times and we restarted it 2 (17:00 and 17:20) times. No reboot after to help investigate without success:
Here is the first 503 at each of this times and the last 200:
Premier pool 503 :
[2023-09-18T14:26:34.978Z] "POST /services.Controller/List HTTP/2" 200 - via_upstream - "-" 100 40332 67 66 "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "575ab642-e2c6-9706-830f-a5c6d2e9a042" "server" "x.x.x.x:port" inbound|port|| x.x.x.x:port x.x.x.x:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=2b34ee377aa11c70d229d6f7bd9af2da
[2023-09-18T14:26:35.028Z] "POST /services.Controller/Version HTTP/2" 503 UR upstream_reset_before_response_started{remote_reset} - "-" 8 0 1 - "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "bb991028-c585-928b-a5e8-cf6297739272" "server" "x.x.x.x:port" inbound|port|| x.x.x.x:port x.x.x.x:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=426881a1f60826c2d999896dbdaf5d7d
[2023-09-18T15:10:20.217Z] "POST /services.Controller/Retrieve HTTP/2" 200 - via_upstream - "-" 60 208 60 60 "10.40.168.188,x.x.x.x,10.42.254.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "ae374654-4c3f-9ce8-a826-2606df0583a8" "server" "10.42.255.59:port" inbound|port|| x.x.x.x:53209 x.x.x.x:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=8cfb61442bf16a912eb340909aab91a0
[2023-09-18T15:10:20.374Z] "POST /services.Controller/Retrieve HTTP/2" 503 UR upstream_reset_before_response_started{remote_reset} - "-" 60 0 1 - "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "b086ec83-dad7-9462-af2a-7f27dc960ebf" "server" "10.42.255.59:port" inbound|port|| x.x.x.x:port 10.42.255.59:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=142edfc9f46c46ab210862e120d4be5f
[2023-09-18T15:27:10.535Z] "HEAD / HTTP/2" 503 UR upstream_reset_before_response_started{remote_reset} - "-" 0 0 0 - "65.154.226.171,x.x.x.x,x.x.x.x" "-" "32d94e0d-4973-962d-98e4-c57d535e1a96" "server" "10.42.13.141:port" inbound|port|| x.x.x.x:49999 10.42.13.141:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=00d815a008582b3ed6682676b3fb8a65
[2023-09-18T15:27:10.525Z] "POST /services.Controller/List HTTP/2" 200 - via_upstream - "-" 124 67308 117 116 "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36" "486e4c88-f3b2-9c8e-a70e-f90787f4b6f4" "server" "x.x.x.x:port" inbound|port|| x.x.x.x:36531 x.x.x.x:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=9b89298f6c07194806d7551b0954184c
This error happen on multiples namespaces. This error happen more and more with time. We receive more request now with new products on our µservice architecture but the operation number/s when working and not working is not too different but a high load can lead to an issue that break something and is only fixed by a restart.
We are using istio as a proxy for grpc and grpc-web.
We need help to investigate further. We added log for grpc server to look it grpc is having an internal issue but as it happen in all our services randomly it look more like an Istio issue.
It can be related to what it said to this comment: https://github.com/envoyproxy/envoy/issues/14981#issuecomment-776079153 but the connection is done by grpc not our code so do not know how to check more into it.
It can also come from data transfer too big that was expected.
Any help on how to have more information why istio stop working correctly at some time will be greatly appreciated.
I have read some post on tcpKeepalive and idleTimeout but I don’t see how it can raise the issue only on some on our pods and not all.
Version
$ istioctl version
client version: 1.19.0
control plane version: 1.17.5
data plane version: 1.17.5 (161 proxies)
$ kubectl version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.4+rke2r1
Additional Information
This is the link to istioctl bug-report at it weight 26 Mo: https://we.tl/t-MT1vJCO0qE
We have two gateways , two virtual services and two services that point to the same deployment.
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get gateways.networking.istio.io
NAME AGE
namespace-name-back-grpc 82d
namespace-name-back-grpc-web 82d
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
namespace-name-back-grpc ClusterIP 10.43.73.183 <none> 80/TCP 82d
namespace-name-back-grpc-web ClusterIP 10.43.133.64 <none> 80/TCP 82d
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get vs
NAME GATEWAYS HOSTS AGE
namespace-name-back-grpc ["namespace-name-back-grpc"] ["back-grpc.server"] 82d
namespace-name-back-grpc-web ["namespace-name-back-grpc-web"] ["back-grpc-web.server"] 82d
Here is the gateways:
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get vs namespace-name-back-grpc -o yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
annotations:
meta.helm.sh/release-name: namespace-name
meta.helm.sh/release-namespace: namespace-name
creationTimestamp: "2023-06-28T16:39:59Z"
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: namespace-name
helm.sh/chart: chart-ng-3.7.0
helm.toolkit.fluxcd.io/name: application
helm.toolkit.fluxcd.io/namespace: namespace-name
name: namespace-name-back-grpc
namespace: namespace-name
resourceVersion: "81065415"
uid: 539a168c-11e5-489f-9619-4062d361ef8d
spec:
gateways:
- namespace-name-back-grpc
hosts:
- back-grpc.server
http:
- match:
- uri:
prefix: /
route:
- destination:
host: namespace-name-back-grpc
port:
number: 80
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get vs namespace-name-back-grpc-web -o yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
annotations:
meta.helm.sh/release-name: namespace-name
meta.helm.sh/release-namespace: namespace-name
creationTimestamp: "2023-06-28T16:39:59Z"
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: namespace-name
helm.sh/chart: chart-ng-3.7.0
helm.toolkit.fluxcd.io/name: application
helm.toolkit.fluxcd.io/namespace: namespace-name
name: namespace-name-back-grpc-web
namespace: namespace-name
resourceVersion: "81065426"
uid: 4f2e9ae4-4c0a-48f0-8d77-93a9f156f54e
spec:
gateways:
- namespace-name-back-grpc-web
hosts:
- back-grpc-web.server
http:
- corsPolicy:
allowCredentials: true
allowHeaders:
- grpc-timeout
- content-type
- keep-alive
- user-agent
- cache-control
- content-type
- content-transfer-encoding
- custom-header-1
- x-accept-content-transfer-encoding
- x-accept-response-streaming
- x-user-agent
- x-grpc-web
- filters
- pagination
- headers
- traceparent
allowMethods:
- POST
- GET
- OPTIONS
- PUT
- DELETE
allowOrigin:
- '*'
exposeHeaders:
- custom-header-1
- grpc-status
- grpc-message
- filters
- pagination
- headers
maxAge: 1728s
match:
- uri:
prefix: /
route:
- destination:
host: namespace-name-back-grpc-web
port:
number: 80
Here is the services:
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get services namespace-name-back-grpc -o yaml
apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: namespace-name
meta.helm.sh/release-namespace: namespace-name
creationTimestamp: "2023-06-28T16:39:59Z"
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: namespace-name
helm.sh/chart: chart-ng-3.7.0
helm.toolkit.fluxcd.io/name: application
helm.toolkit.fluxcd.io/namespace: namespace-name
name: namespace-name-back-grpc
namespace: namespace-name
resourceVersion: "81065206"
uid: 84d611f6-9c0f-4fe3-9a96-2669eb56c716
spec:
clusterIP: x.x.x.x
clusterIPs:
- x.x.x.x
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: grpc
port: port0
protocol: TCP
targetPort: port1
selector:
app: namespace-name-grpc-api
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get services namespace-name-back-grpc-web -o yaml
apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: namespace-name
meta.helm.sh/release-namespace: namespace-name
creationTimestamp: "2023-06-28T16:39:59Z"
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: namespace-name
helm.sh/chart: chart-ng-3.7.0
helm.toolkit.fluxcd.io/name: application
helm.toolkit.fluxcd.io/namespace: namespace-name
name: namespace-name-back-grpc-web
namespace: namespace-name
resourceVersion: "81065221"
uid: 5f601e92-21bf-4133-8613-d1572cfd139d
spec:
clusterIP: x.x.x.x
clusterIPs:
- x.x.x.x
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: grpc-web
port: port0
protocol: TCP
targetPort: port1
selector:
app: namespace-name-grpc-api
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Here is the gateways:
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get gateways.networking.istio.io namespace-name-back-grpc -o yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
annotations:
meta.helm.sh/release-name: namespace-name
meta.helm.sh/release-namespace: namespace-name
creationTimestamp: "2023-06-28T16:39:59Z"
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: namespace-name
helm.sh/chart: chart-ng-3.7.0
helm.toolkit.fluxcd.io/name: application
helm.toolkit.fluxcd.io/namespace: namespace-name
name: namespace-name-back-grpc
namespace: namespace-name
resourceVersion: "81065337"
uid: 34b93768-61a1-4559-9ce6-333855361port08
spec:
selector:
istio: ingressgateway
servers:
- hosts:
- back-grpc.application.server
port:
name: http
number: port0
protocol: HTTP2
- hosts:
- back-grpc.application.server
port:
name: https
number: port3
protocol: HTTPS
tls:
credentialName: namespace-name-back-grpc
mode: SIMPLE
adrien@adrien-XPS:~/socotec$ kubectl -n namespace-name get gateways.networking.istio.io namespace-name-back-grpc-web -o yaml
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
annotations:
meta.helm.sh/release-name: namespace-name
meta.helm.sh/release-namespace: namespace-name
creationTimestamp: "2023-06-28T16:39:59Z"
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: namespace-name
helm.sh/chart: chart-ng-3.7.0
helm.toolkit.fluxcd.io/name: application
helm.toolkit.fluxcd.io/namespace: namespace-name
name: namespace-name-back-grpc-web
namespace: namespace-name
resourceVersion: "81065349"
uid: 55d4bd3d-a7port0-4b1d-8cfb-6d128236074c
spec:
selector:
istio: ingressgateway
servers:
- hosts:
- back-grpc-web.application.server
port:
name: http
number: port0
protocol: HTTP2
- hosts:
- back-grpc-web.application.server
port:
name: https
number: port3
protocol: HTTPS
tls:
credentialName: namespace-name-back-grpc-web
mode: SIMPLE
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 24 (11 by maintainers)
Let’s wait to see any new findings
There is no log in the server. We added more verbosity to the grpc server and waiting for the bug to happen again. Will send the detailled log when happening again.
For any informations here is the documentation of gRPC: https://grpc.io/docs/guides/keepalive/ We didn’t change default settings on this parameters except in only one of our service: (“grpc.keepalive_time_ms”, 60000),
But the issue happen in other services that this one.
We may have changed some parameter on front haproxy to avoid timeout. We have long PDF generation stream request that can last more than 3 minutes.
Issue happen also in small unary call.