grpc: [BUG] WSAEFAULT
Problem
Brieftly to say, the erorr is overlapped WSARecv call in gRPC that fails with a WSAEFAULT error
.
What version of gRPC and what language are you using?
1.37.1
What operating system (Linux, Windows,…) and version?
windows 10
What runtime / compiler are you using (e.g. python version or version of gcc)
Python3.9 and VS2019 compiler
What did you do?
Run an sample ping-pong program with server side written by c++ and client side written written by csharp. let it run whole night.
What did you expect to see?
no errors
What did you see instead?
WSAEFAULT
I suspect if this line of code causes the bug https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L290 according to the msdn api doc,
WSAEWOULDBLOCK | Windows NT: Overlapped sockets: there are too many outstanding overlapped I/O requests. Nonoverlapped sockets: The socket is marked as nonblocking and the receive operation cannot be completed immediately.
is it handled correctly when the error is other than WSAEWOULDBLOCK
? so I am wondering if the right check here is if (info->wsa_error == 0)
Also why not continuing reading more data when the received bytes == the recv buffer size in case there are more data in the socket buffer. there shoudl be a while loop for non-overlapped wsarecv
until it got ewouldblock
error.
I also suspect if this line of code causes the bug. https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L299 according to msdn api doc,
lpNumberOfBytesRecvd A pointer to the number, in bytes, of data received by this call if the receive operation completes immediately. Use NULL for this parameter if the lpOverlapped parameter is not NULL to avoid potentially erroneous results. This parameter can be NULL only if the lpOverlapped parameter is not NULL.
So I am wondering if &bytes_read
shoudl be changed to NULL
.
Also I am wondering how this piece of code works ?
https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L192
why it is checking tcp->shutting_down instead of checking the retunred error code from wsarecv
to detect the local/remote socket shutdown ?
Also why it does not apply the locker when read the value of tcp->shutting_down
? I can see it is protected with locker when set to true. https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L456
Also why not using the erro code from the next time wsarecv
or wsasend
to detect the local/remote socket abort/shutdown. The use of tcp-shutting_down
looks a bit doggy.
Also there is inconsistency here: https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L292 https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L311
finally I am wondering if it is correct to mix use of overlapped and non-overlapped wsarecv
api ?
when you created the socket, I can see it is overlapped socket. but why the non-overlapped wsarecv
is used here ?
https://github.com/grpc/grpc/blob/fd3bd70939fb4239639fbd26143ec416366e4157/src/core/lib/iomgr/socket_windows.cc#L188
Proposed Solution:
The way to operate with winsock in grpc looks a little bit doggy. the winsock apis are quite hard to make it right.
Please refer to the libuv as an example about how to operate winsock: either use overlapped or non-overlapped wsarecv
.
overlapped wsarecv
: https://github.com/libuv/libuv/blob/9918a1743816dc49d6c664e41641d78ffd4a0705/src/win/tcp.c#L513
non-overlapped wsarecv
: https://github.com/libuv/libuv/blob/9918a1743816dc49d6c664e41641d78ffd4a0705/src/win/tcp.c#L1068
I will recompile the grpc with the proposed changes later and then let us see how it goes.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (9 by maintainers)
yes, we already discussed this (https://github.com/grpc/grpc/pull/28432#discussion_r1027319070), and you rightfully corrected me on my misunderstanding of the API. So you should ignore my remark on the stack allocation of buffers[MAX_WSABUF_COUNT]