grpc: Parse failed with INTERNAL: Error parsing ':method' metadata randomly appear until reboot
What version of gRPC and what language are you using?
We are using python with version of gRPC going from 1.44 to 1.59. We are in a micro services arhcitecture and we update dependencies only when security alerte or needed.
What operating system (Linux, Windows,…) and version?
We deploy all back µservices with docker with as base this image: python:3.8-slim-buster
We are on kube 1.27.4.
What runtime / compiler are you using (e.g. python version or version of gcc)
We are using python 3.8 for all services
What did you do?
For an unknow cause gRPC start failing parsing metadata with this error (full message at the end of the issue with more debug info):
E1012 08:08:15.270713685 57172 parsing.cc:823] INCOMING[0x7fbc20026760;0x7fbbdc02b590]: Parse failed with INTERNAL: Error parsing ':method' metadata [type.googleapis.com/grpc.status.int.stream_id='0']
This error arrive randomly on our service without any action we could have found When it’s first arrive on the pod it arrive approxmatly 50% on the time after that until we restart the pod. Restarting the pod make everything work fine again.
What did you expect to see?
Call continue working without having to restart the pod.
What did you see instead?
gRPC full message of a 503:
I1012 08:08:14.781017727 57172 tcp_posix.cc:670] TCP:0x7fbc181717b0 notify_on_read
I1012 08:08:14.781021935 57172 chttp2_transport.cc:2662] ipv4:127.0.0.6:44299: Complete BDP ping err=OK
I1012 08:08:14.935100866 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 135094073, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.135808033 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 335802182, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.270414693 57172 tcp_posix.cc:1095] TCP:0x7fbc181717b0 got_read: OK
I1012 08:08:15.270465018 57172 tcp_posix.cc:885] TCP:0x7fbc181717b0 do_read
I1012 08:08:15.270486278 57172 tcp_posix.cc:820] TCP:0x7fbc181717b0 call_cb 0x7fbc200268e8 0x7fbc6b4ee890:0x7fbc20026760
I1012 08:08:15.270500154 57172 tcp_posix.cc:822] READ 0x7fbc181717b0 (peer=ipv4:127.0.0.6:44299) error=OK
I1012 08:08:15.270520773 57172 parsing.cc:320] INCOMING[0x7fbc20026760]: HEADERS:END_HEADERS len:3521 id:0x0000b10b
I1012 08:08:15.270571088 57172 call.cc:1530] ops[0]: RECV_INITIAL_METADATA ptr=0x7fbbdc02ac48
I1012 08:08:15.270596996 57172 parsing.cc:652] [t:0x7fbc20026760 fd:13 peer:ipv4:127.0.0.6:44299] Accepting new stream
I1012 08:08:15.270612806 57172 parsing.cc:690] parsing initial_metadata
E1012 08:08:15.270713685 57172 parsing.cc:823] INCOMING[0x7fbc20026760;0x7fbbdc02b590]: Parse failed with INTERNAL: Error parsing ':method' metadata [type.googleapis.com/grpc.status.int.stream_id='0']
I1012 08:08:15.270744063 57172 chttp2_transport.cc:888] W:0x7fbc20026760 SERVER [ipv4:127.0.0.6:44299] state IDLE -> WRITING [RST_STREAM]
I1012 08:08:15.270788276 57172 parsing.cc:320] INCOMING[0x7fbc20026760]: DATA:END_STREAM len:327 id:0x0000b10b
I1012 08:08:15.270810668 57172 chttp2_transport.cc:2908] ipv4:127.0.0.6:44299: Keepalive ping cancelled. Resetting timer.
I1012 08:08:15.270829754 57172 tcp_posix.cc:670] TCP:0x7fbc181717b0 notify_on_read
I1012 08:08:15.270848339 57172 chttp2_transport.cc:1605] perform_stream_op[s=0x7fbbdc02b590; op=0x7fbbdc02c480]: RECV_INITIAL_METADATA
I1012 08:08:15.270863157 57172 chttp2_transport.cc:1343] perform_stream_op_locked[s=0x7fbbdc02b590; op=0x7fbbdc02c480]: RECV_INITIAL_METADATA; on_complete = (nil)
I1012 08:08:15.270891079 57172 completion_queue.cc:708] cq_end_op_for_next(cq=0x555dd064cf50, tag=0x7fbc542e0940, error=OK, done=0x7fbc6b7ad580, done_arg=0x7fbc2003e170, storage=0x7fbc2003e198)
I1012 08:08:15.270920635 57172 chttp2_transport.cc:888] W:0x7fbc20026760 SERVER [ipv4:127.0.0.6:44299] state WRITING -> WRITING [begin write in current thread]
I1012 08:08:15.270940703 57172 tcp_posix.cc:1818] WRITE 0x7fbc181717b0 (peer=ipv4:127.0.0.6:44299)
I1012 08:08:15.270953456 57172 tcp_posix.cc:1818] WRITE 0x7fbc181717b0 (peer=ipv4:127.0.0.6:44299)
I1012 08:08:15.270957244 24 completion_queue.cc:1076] RETURN_EVENT[0x555dd064cf50]: OP_COMPLETE: tag:0x7fbc542e0940 OK
I1012 08:08:15.271047433 24 metadata_array.cc:35] grpc_metadata_array_destroy(array=0x7fbc542e0970)
I1012 08:08:15.271168050 24 call.cc:3615] grpc_call_start_batch(call=0x7fbbdc029d30, ops=0x7fbc2001b5b0, nops=1, tag=0x7fbc070eb630, reserved=(nil))
I1012 08:08:15.271176546 24 call.cc:1530] ops[0]: RECV_CLOSE_ON_SERVER cancelled=0x7fbc06b13da8
I1012 08:08:15.271182507 24 chttp2_transport.cc:1605] perform_stream_op[s=0x7fbbdc02b590; op=0x7fbbdc02ceb0]: RECV_TRAILING_METADATA
I1012 08:08:15.271241388 57172 tcp_posix.cc:1866] write: OK
I1012 08:08:15.271389367 57172 chttp2_transport.cc:1343] perform_stream_op_locked[s=0x7fbbdc02b590; op=0x7fbbdc02ceb0]: RECV_TRAILING_METADATA; on_complete = (nil)
I1012 08:08:15.271435423 57172 completion_queue.cc:708] cq_end_op_for_next(cq=0x555dd064cf50, tag=0x7fbc070eb630, error=OK, done=0x7fbc6b783720, done_arg=0x7fbbdc02cea0, storage=0x7fbbdc02cef0)
I1012 08:08:15.271464618 57172 chttp2_transport.cc:888] W:0x7fbc20026760 SERVER [ipv4:127.0.0.6:44299] state WRITING -> IDLE [finish writing]
I1012 08:08:15.271249684 24 init.cc:150] grpc_init(void)
I1012 08:08:15.271561250 24 completion_queue.cc:1076] RETURN_EVENT[0x555dd064cf50]: OP_COMPLETE: tag:0x7fbc070eb630 OK
I1012 08:08:15.271595584 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 471594172, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.271714098 26 call.cc:968] grpc_call_unref(c=0x7fbbdc029d30)
I1012 08:08:15.271737712 26 metadata_array.cc:35] grpc_metadata_array_destroy(array=0x7fbbdc02ac48)
I1012 08:08:15.271749233 26 init.cc:178] grpc_shutdown(void)
I1012 08:08:15.271753011 26 call_details.cc:36] grpc_call_details_destroy(details=0x7fbc14319700)
I1012 08:08:15.271756036 26 init.cc:178] grpc_shutdown(void)
I1012 08:08:15.471828299 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 671818340, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.672849076 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098095, tv_nsec: 872839648, clock_type: 1 }, reserved=(nil))
I1012 08:08:15.873838455 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 73829578, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.074815139 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 271592448, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.271842002 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 471834007, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.472829337 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 672820721, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.673815288 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098096, tv_nsec: 873809006, clock_type: 1 }, reserved=(nil))
I1012 08:08:16.874815968 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50, deadline=gpr_timespec { tv_sec: 1697098097, tv_nsec: 74810127, clock_type: 1 }, reserved=(nil))
I1012 08:08:17.075811878 24 completion_queue.cc:964] grpc_completion_queue_next(cq=0x555dd064cf50,
Istio log matching this gRPC log:
2023-10-12T08:08:15.270118Z debug envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:265 [C107599] using existing fully connected connection thread=25
2023-10-12T08:08:15.270140Z debug envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:182 [C107599] creating stream thread=25
2023-10-12T08:08:15.270167Z debug envoy router external/envoy/source/common/router/upstream_request.cc:581 [C243618][S8878698873862241484] pool ready thread=25
2023-10-12T08:08:15.270257Z debug envoy http external/envoy/source/common/http/conn_manager_impl.cc:955 [C243618][S8878698873862241484] request end stream thread=25
2023-10-12T08:08:15.270312Z debug envoy client external/envoy/source/common/http/codec_client.cc:139 [C107599] encode complete thread=25
2023-10-12T08:08:15.271027Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1363 [C107599] stream 45323 closed: 2 thread=25
2023-10-12T08:08:15.271052Z debug envoy client external/envoy/source/common/http/codec_client.cc:156 [C107599] request reset thread=25
2023-10-12T08:08:15.271083Z debug envoy pool external/envoy/source/common/conn_pool/conn_pool_base.cc:215 [C107599] destroying stream: 0 remaining thread=25
2023-10-12T08:08:15.271100Z debug envoy router external/envoy/source/common/router/router.cc:1208 [C243618][S8878698873862241484] upstream reset: reset reason: remote reset, transport failure reason: thread=25
2023-10-12T08:08:15.271137Z debug envoy http external/envoy/source/common/http/filter_manager.cc:967 [C243618][S8878698873862241484] Sending local reply with details upstream_reset_before_response_started{remote_reset} thread=25
2023-10-12T08:08:15.271216Z debug envoy http external/envoy/source/common/http/conn_manager_impl.cc:1596 [C243618][S8878698873862241484] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '0'
'content-type', 'application/grpc-web-text+proto'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: remote reset'
'grpc-status', '14'
'x-envoy-peer-metadata', 'ChoKDkFQUF9DT05UQUlORVJTEggaBmRqYW5nbwoaCgpDTFVTVEVSX0lEEgwaCkt1YmVybmV0ZXMKHQoMSU5TVEFOQ0VfSVBTEg0aCzEwLjQyLjMzLjk2ChkKDUlTVElPX1ZFUlNJT04SCBoGMS4xNy41CoYDCgZMQUJFTFMS+wIq+AIKLQoDYXBwEiYaJHByb2R1Y3Rpb24tcmVmZXJlbmNpYWwtYmFjay1ncnBjLWFwaQopChthcHAua3ViZXJuZXRlcy5pby9jb21wb25lbnQSChoIZ3JwYy1hcGkKJgocYXBwLmt1YmVybmV0ZXMuaW8vbWFuYWdlZC1ieRIGGgRIZWxtCjcKFmFwcC5rdWJlcm5ldGVzLmlvL25hbWUSHRobcHJvZHVjdGlvbi1yZWZlcmVuY2lhbC1iYWNrCiIKDWhlbG0uc2gvY2hhcnQSERoPY2hhcnQtbmctMy44LjI0CiQKGXNlY3VyaXR5LmlzdGlvLmlvL3Rsc01vZGUSBxoFaXN0aW8KQAofc2VydmljZS5pc3Rpby5pby9jYW5vbmljYWwtbmFtZRIdGhtwcm9kdWN0aW9uLXJlZmVyZW5jaWFsLWJhY2sKLwojc2VydmljZS5pc3Rpby5pby9jYW5vbmljYWwtcmV2aXNpb24SCBoGbGF0ZXN0ChoKB01FU0hfSUQSDxoNY2x1c3Rlci5sb2NhbAo/CgROQU1FEjcaNXByb2R1Y3Rpb24tcmVmZXJlbmNpYWwtYmFjay1ncnBjLWFwaS03OGI0ZmQ1OGI0LTZrNWs2CioKCU5BTUVTUEFDRRIdGhtwcm9kdWN0aW9uLXJlZmVyZW5jaWFsLWJhY2sKfAoFT1dORVIScxpxa3ViZXJuZXRlczovL2FwaXMvYXBwcy92MS9uYW1lc3BhY2VzL3Byb2R1Y3Rpb24tcmVmZXJlbmNpYWwtYmFjay9kZXBsb3ltZW50cy9wcm9kdWN0aW9uLXJlZmVyZW5jaWFsLWJhY2stZ3JwYy1hcGkKFwoRUExBVEZPUk1fTUVUQURBVEESAioACjcKDVdPUktMT0FEX05BTUUSJhokcHJvZHVjdGlvbi1yZWZlcmVuY2lhbC1iYWNrLWdycGMtYXBp'
'x-envoy-peer-metadata-id', 'sidecar~10.42.33.96~namespace-name-back-grpc-api-78b4fd58b4-6k5k6.namespace-name-back~namespace-name-back.svc.cluster.local'
'date', 'Thu, 12 Oct 2023 08:08:14 GMT'
'server', 'istio-envoy'
thread=25
2023-10-12T08:08:15.271348Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1363 [C243618] stream 273 closed: 0 thread=25
2023-10-12T08:08:15.271487Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1416 [C243618] Recouping 0 bytes of flow control window for stream 273. thread=25
2023-10-12T08:08:15.271812Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1416 [C107599] Recouping 0 bytes of flow control window for stream 45323. thread=25
[2023-10-12T08:08:15.269Z] "POST /server.Controller/Update HTTP/2" 503 UR upstream_reset_before_response_started{remote_reset} - "-" 436 0 1 - "x.x.x.x,x.x.x.x,x.x.x.x" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2045.60" "a218c236-7f65-9fa8-a944-ffcafc791fca" "back-grpc-web.server" "x.x.x.x:port" inbound|port|| x.x.x.x:44299 10.42.33.96:port x.x.x.x:0 outbound_.80_._.server.svc.cluster.local default traceID=62fb4cc014e66ed6833346775b0bafdd
Usually it’s just on 1/2 pods of the same microservice (we usually have between 3 and 8 pods depending of the micro service load) an incident look like this graph (with 100% meaning we just rebooted)
Today we had a bigger incident it is what it looked like (we kept a pod failing to inspect with ksniff and wireshark):
Anything else we should know about your project / environment?
We have approximatly 20 µservices that use grpcio (the python lib). All can be touched but we are way more impacted by the two oldest that use the the sync lib and not the async.
We have also an open issue on istio: https://github.com/istio/istio/issues/47037 as it may possible that istio or envoy modify the request headers and gRPC just crashing because of it but we will really appreciate a way to confirm that in a way or an other.
The keep alive setting has been a lead on this one but we didn’t change the default except for one of our service.
We are using bot python and grpc-web clients.
We didn’t see any correlation between server load on server and the error happening, but it’s not excluded.
The issue happen every 3 every weeks approximatly but it’s not a point to follow as it can lead to wrong interpretations and it happen 2 time in a week this week.
We use metadata for data filtering and some use case can send a lot of filters data in metadata but we have correctly ajusted the setting of the grpc channel to handle it.
Except for the metadata and request/response message size we didn’t change any grpc channel setting
We tried to use ksniff and wireshark but wasn’t able to capture incoming call properly only external. I put the export containing both 200 and 503 just in case but we wasn’t able to locate the 503 in the packet capture.
- file removed after closing issue.*
Any help on how to better debug this will be really appreciate. Especially to display all the metadata before the parsing.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 21 (6 by maintainers)
The issue here is also causing the server to reject all subsequent requests once a first corrupted(?) request comes in.
@gnossen and @yashykt hi, we have been hitting a similar issue where the grpc python server fails with a similar error. Pls take a look here. An ordinary http request seems to make the grpc server to always reset the stream once received. Any ideas or is there any plan to fix this?