grpc: Parse failed with INTERNAL: Error parsing ':method' metadata randomly appear until reboot

What version of gRPC and what language are you using?

We are using python with version of gRPC going from 1.44 to 1.59. We are in a micro services arhcitecture and we update dependencies only when security alerte or needed.

What operating system (Linux, Windows,…) and version?

We deploy all back µservices with docker with as base this image: python:3.8-slim-buster

We are on kube 1.27.4.

What runtime / compiler are you using (e.g. python version or version of gcc)

We are using python 3.8 for all services

What did you do?

For an unknow cause gRPC start failing parsing metadata with this error (full message at the end of the issue with more debug info):

E1012 08:08:15.270713685   57172 parsing.cc:823]                       INCOMING[0x7fbc20026760;0x7fbbdc02b590]: Parse failed with INTERNAL: Error parsing ':method' metadata [type.googleapis.com/grpc.status.int.stream_id='0']

This error arrive randomly on our service without any action we could have found When it’s first arrive on the pod it arrive approxmatly 50% on the time after that until we restart the pod. Restarting the pod make everything work fine again.

What did you expect to see?

Call continue working without having to restart the pod.

What did you see instead?

gRPC full message of a 503:

I1012 08:08:14.781017727   57172 tcp_posix.cc:670]                     TCP:0x7fbc181717b0 notify_on_read
I1012 08:08:14.781021935   57172 chttp2_transport.cc:2662]             ipv4:127.0.0.6:44299: Complete BDP ping err=OK
I1012 08:08:14.935100866      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 135094073, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.135808033      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 335802182, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.270414693   57172 tcp_posix.cc:1095]                    TCP:0x7fbc181717b0 got_read: OK
I1012 08:08:15.270465018   57172 tcp_posix.cc:885]                     TCP:0x7fbc181717b0 do_read
I1012 08:08:15.270486278   57172 tcp_posix.cc:820]                     TCP:0x7fbc181717b0 call_cb 0x7fbc200268e8 0x7fbc6b4ee890:0x7fbc20026760
I1012 08:08:15.270500154   57172 tcp_posix.cc:822]                     READ 0x7fbc181717b0 (peer=ipv4:127.0.0.6:44299) error=OK
I1012 08:08:15.270520773   57172 parsing.cc:320]                       INCOMING[0x7fbc20026760]: HEADERS:END_HEADERS len:3521 id:0x0000b10b
I1012 08:08:15.270571088   57172 call.cc:1530]                         ops[0]: RECV_INITIAL_METADATA ptr=0x7fbbdc02ac48
I1012 08:08:15.270596996   57172 parsing.cc:652]                       [t:0x7fbc20026760 fd:13 peer:ipv4:127.0.0.6:44299] Accepting new stream
I1012 08:08:15.270612806   57172 parsing.cc:690]                       parsing initial_metadata
E1012 08:08:15.270713685   57172 parsing.cc:823]                       INCOMING[0x7fbc20026760;0x7fbbdc02b590]: Parse failed with INTERNAL: Error parsing ':method' metadata [type.googleapis.com/grpc.status.int.stream_id='0']
I1012 08:08:15.270744063   57172 chttp2_transport.cc:888]              W:0x7fbc20026760 SERVER [ipv4:127.0.0.6:44299] state IDLE -> WRITING [RST_STREAM]
I1012 08:08:15.270788276   57172 parsing.cc:320]                       INCOMING[0x7fbc20026760]: DATA:END_STREAM len:327 id:0x0000b10b
I1012 08:08:15.270810668   57172 chttp2_transport.cc:2908]             ipv4:127.0.0.6:44299: Keepalive ping cancelled. Resetting timer.
I1012 08:08:15.270829754   57172 tcp_posix.cc:670]                     TCP:0x7fbc181717b0 notify_on_read
I1012 08:08:15.270848339   57172 chttp2_transport.cc:1605]             perform_stream_op[s=0x7fbbdc02b590; op=0x7fbbdc02c480]:  RECV_INITIAL_METADATA
I1012 08:08:15.270863157   57172 chttp2_transport.cc:1343]             perform_stream_op_locked[s=0x7fbbdc02b590; op=0x7fbbdc02c480]:  RECV_INITIAL_METADATA; on_complete = (nil)
I1012 08:08:15.270891079   57172 completion_queue.cc:708]              cq_end_op_for_next(cq=0x555dd064cf50, tag=0x7fbc542e0940, error=OK, done=0x7fbc6b7ad580, done_arg=0x7fbc2003e170, storage=0x7fbc2003e198)
I1012 08:08:15.270920635   57172 chttp2_transport.cc:888]              W:0x7fbc20026760 SERVER [ipv4:127.0.0.6:44299] state WRITING -> WRITING [begin write in current thread]
I1012 08:08:15.270940703   57172 tcp_posix.cc:1818]                    WRITE 0x7fbc181717b0 (peer=ipv4:127.0.0.6:44299)
I1012 08:08:15.270953456   57172 tcp_posix.cc:1818]                    WRITE 0x7fbc181717b0 (peer=ipv4:127.0.0.6:44299)
I1012 08:08:15.270957244      24 completion_queue.cc:1076]             RETURN_EVENT[0x555dd064cf50]: OP_COMPLETE: tag:0x7fbc542e0940 OK
I1012 08:08:15.271047433      24 metadata_array.cc:35]                 grpc_metadata_array_destroy(array=0x7fbc542e0970)
I1012 08:08:15.271168050      24 call.cc:3615]                         grpc_call_start_batch(call=0x7fbbdc029d30, ops=0x7fbc2001b5b0, nops=1, tag=0x7fbc070eb630, reserved=(nil))
I1012 08:08:15.271176546      24 call.cc:1530]                         ops[0]: RECV_CLOSE_ON_SERVER cancelled=0x7fbc06b13da8
I1012 08:08:15.271182507      24 chttp2_transport.cc:1605]             perform_stream_op[s=0x7fbbdc02b590; op=0x7fbbdc02ceb0]:  RECV_TRAILING_METADATA
I1012 08:08:15.271241388   57172 tcp_posix.cc:1866]                    write: OK
I1012 08:08:15.271389367   57172 chttp2_transport.cc:1343]             perform_stream_op_locked[s=0x7fbbdc02b590; op=0x7fbbdc02ceb0]:  RECV_TRAILING_METADATA; on_complete = (nil)
I1012 08:08:15.271435423   57172 completion_queue.cc:708]              cq_end_op_for_next(cq=0x555dd064cf50, tag=0x7fbc070eb630, error=OK, done=0x7fbc6b783720, done_arg=0x7fbbdc02cea0, storage=0x7fbbdc02cef0)
I1012 08:08:15.271464618   57172 chttp2_transport.cc:888]              W:0x7fbc20026760 SERVER [ipv4:127.0.0.6:44299] state WRITING -> IDLE [finish writing]
I1012 08:08:15.271249684      24 init.cc:150]                          grpc_init(void)
I1012 08:08:15.271561250      24 completion_queue.cc:1076]             RETURN_EVENT[0x555dd064cf50]: OP_COMPLETE: tag:0x7fbc070eb630 OK
I1012 08:08:15.271595584      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 471594172, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.271714098      26 call.cc:968]                          grpc_call_unref(c=0x7fbbdc029d30)
I1012 08:08:15.271737712      26 metadata_array.cc:35]                 grpc_metadata_array_destroy(array=0x7fbbdc02ac48)
I1012 08:08:15.271749233      26 init.cc:178]                          grpc_shutdown(void)
I1012 08:08:15.271753011      26 call_details.cc:36]                   grpc_call_details_destroy(details=0x7fbc14319700)
I1012 08:08:15.271756036      26 init.cc:178]                          grpc_shutdown(void)
I1012 08:08:15.471828299      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 671818340, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.672849076      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 872839648, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.873838455      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 73829578, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.074815139      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 271592448, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.271842002      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 471834007, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.472829337      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 672820721, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.673815288      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 873809006, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.874815968      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098097, tv_nsec: 74810127, clock_type: 1 }, reserved=(nil))
I1012 08:08:17.075811878      24 completion_queue.cc:964]              grpc_completion_queue_next(cq=0x555dd064cf50, 

Istio log matching this gRPC log:

2023-10-12T08:08:15.270118Z debug envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:265 [C107599] using existing fully connected connection thread=25
2023-10-12T08:08:15.270140Z debug envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:182 [C107599] creating stream thread=25
2023-10-12T08:08:15.270167Z debug envoy router external/envoy/source/common/router/upstream_request.cc:581  [C243618][S8878698873862241484] pool ready  thread=25
2023-10-12T08:08:15.270257Z debug envoy http external/envoy/source/common/http/conn_manager_impl.cc:955 [C243618][S8878698873862241484] request end stream  thread=25
2023-10-12T08:08:15.270312Z debug envoy client external/envoy/source/common/http/codec_client.cc:139  [C107599] encode complete thread=25
2023-10-12T08:08:15.271027Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1363  [C107599] stream 45323 closed: 2  thread=25
2023-10-12T08:08:15.271052Z debug envoy client external/envoy/source/common/http/codec_client.cc:156  [C107599] request reset thread=25
2023-10-12T08:08:15.271083Z debug envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:215 [C107599] destroying stream: 0 remaining  thread=25
2023-10-12T08:08:15.271100Z debug envoy router external/envoy/source/common/router/router.cc:1208 [C243618][S8878698873862241484] upstream reset: reset reason: remote reset, transport failure reason:   thread=25
2023-10-12T08:08:15.271137Z debug envoy http external/envoy/source/common/http/filter_manager.cc:967  [C243618][S8878698873862241484] Sending local reply with details upstream_reset_before_response_started{remote_reset} thread=25
2023-10-12T08:08:15.271216Z debug envoy http external/envoy/source/common/http/conn_manager_impl.cc:1596  [C243618][S8878698873862241484] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '0'
'content-type', 'application/grpc-web-text+proto'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: remote reset'
'grpc-status', '14'
'x-envoy-peer-metadata', 'ChoKDkFQUF9DT05UQUlORVJTEggaBmRqYW5nbwoaCgpDTFVTVEVSX0lEEgwaCkt1YmVybmV0ZXMKHQoMSU5TVEFOQ0VfSVBTEg0aCzEwLjQyLjMzLjk2ChkKDUlTVElPX1ZFUlNJT04SCBoGMS4xNy41CoYDCgZMQUJFTFMS+wIq+AIKLQoDYXBwEiYaJHByb2R1Y3Rpb24tcmVmZXJlbmNpYWwtYmFjay1ncnBjLWFwaQopChthcHAua3ViZXJuZXRlcy5pby9jb21wb25lbnQSChoIZ3JwYy1hcGkKJgocYXBwLmt1YmVybmV0ZXMuaW8vbWFuYWdlZC1ieRIGGgRIZWxtCjcKFmFwcC5rdWJlcm5ldGVzLmlvL25hbWUSHRobcHJvZHVjdGlvbi1yZWZlcmVuY2lhbC1iYWNrCiIKDWhlbG0uc2gvY2hhcnQSERoPY2hhcnQtbmctMy44LjI0CiQKGXNlY3VyaXR5LmlzdGlvLmlvL3Rsc01vZGUSBxoFaXN0aW8KQAofc2VydmljZS5pc3Rpby5pby9jYW5vbmljYWwtbmFtZRIdGhtwcm9kdWN0aW9uLXJlZmVyZW5jaWFsLWJhY2sKLwojc2VydmljZS5pc3Rpby5pby9jYW5vbmljYWwtcmV2aXNpb24SCBoGbGF0ZXN0ChoKB01FU0hfSUQSDxoNY2x1c3Rlci5sb2NhbAo/CgROQU1FEjcaNXByb2R1Y3Rpb24tcmVmZXJlbmNpYWwtYmFjay1ncnBjLWFwaS03OGI0ZmQ1OGI0LTZrNWs2CioKCU5BTUVTUEFDRRIdGhtwcm9kdWN0aW9uLXJlZmVyZW5jaWFsLWJhY2sKfAoFT1dORVIScxpxa3ViZXJuZXRlczovL2FwaXMvYXBwcy92MS9uYW1lc3BhY2VzL3Byb2R1Y3Rpb24tcmVmZXJlbmNpYWwtYmFjay9kZXBsb3ltZW50cy9wcm9kdWN0aW9uLXJlZmVyZW5jaWFsLWJhY2stZ3JwYy1hcGkKFwoRUExBVEZPUk1fTUVUQURBVEESAioACjcKDVdPUktMT0FEX05BTUUSJhokcHJvZHVjdGlvbi1yZWZlcmVuY2lhbC1iYWNrLWdycGMtYXBp'
'x-envoy-peer-metadata-id', 'sidecar~10.42.33.96~namespace-name-back-grpc-api-78b4fd58b4-6k5k6.namespace-name-back~namespace-name-back.svc.cluster.local'
'date', 'Thu, 12 Oct 2023 08:08:14 GMT'
'server', 'istio-envoy'
  thread=25
2023-10-12T08:08:15.271348Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1363  [C243618] stream 273 closed: 0  thread=25
2023-10-12T08:08:15.271487Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1416  [C243618] Recouping 0 bytes of flow control window for stream 273.  thread=25
2023-10-12T08:08:15.271812Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1416  [C107599] Recouping 0 bytes of flow control window for stream 45323.  thread=25
[2023-10-12T08:08:15.269Z] "POST /server.Controller/Update HTTP/2" 503 UR upstream_reset_before_response_started{remote_reset} - "-" 436 0 1 - "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.60" "a218c236-7f65-9fa8-a944-ffcafc791fca" "back-grpc-web.server" "x.x.x.x:port" inbound|port|| x.x.x.x:44299 10.42.33.96:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=62fb4cc014e66ed6833346775b0bafdd

Usually it’s just on 1/2 pods of the same microservice (we usually have between 3 and 8 pods depending of the micro service load) an incident look like this graph (with 100% meaning we just rebooted)

image

Today we had a bigger incident it is what it looked like (we kept a pod failing to inspect with ksniff and wireshark):

Capture d’écran du 2023-10-18 17-42-18

Anything else we should know about your project / environment?

We have approximatly 20 µservices that use grpcio (the python lib). All can be touched but we are way more impacted by the two oldest that use the the sync lib and not the async.

We have also an open issue on istio: https://github.com/istio/istio/issues/47037 as it may possible that istio or envoy modify the request headers and gRPC just crashing because of it but we will really appreciate a way to confirm that in a way or an other.

The keep alive setting has been a lead on this one but we didn’t change the default except for one of our service.

We are using bot python and grpc-web clients.

We didn’t see any correlation between server load on server and the error happening, but it’s not excluded.

The issue happen every 3 every weeks approximatly but it’s not a point to follow as it can lead to wrong interpretations and it happen 2 time in a week this week.

We use metadata for data filtering and some use case can send a lot of filters data in metadata but we have correctly ajusted the setting of the grpc channel to handle it.

Except for the metadata and request/response message size we didn’t change any grpc channel setting

We tried to use ksniff and wireshark but wasn’t able to capture incoming call properly only external. I put the export containing both 200 and 503 just in case but we wasn’t able to locate the 503 in the packet capture.

  • file removed after closing issue.*

Any help on how to better debug this will be really appreciate. Especially to display all the metadata before the parsing.

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 21 (6 by maintainers)

Most upvoted comments

The issue here is also causing the server to reject all subsequent requests once a first corrupted(?) request comes in.

@gnossen and @yashykt hi, we have been hitting a similar issue where the grpc python server fails with a similar error. Pls take a look here. An ordinary http request seems to make the grpc server to always reset the stream once received. Any ideas or is there any plan to fix this?