grpc: [BUG] WSAEFAULT

Problem

Brieftly to say, the erorr is overlapped WSARecv call in gRPC that fails with a WSAEFAULT error.

What version of gRPC and what language are you using?

1.37.1

What operating system (Linux, Windows,…) and version?

windows 10

What runtime / compiler are you using (e.g. python version or version of gcc)

Python3.9 and VS2019 compiler

What did you do?

Run an sample ping-pong program with server side written by c++ and client side written written by csharp. let it run whole night.

What did you expect to see?

no errors

What did you see instead?

WSAEFAULT

I suspect if this line of code causes the bug https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L290 according to the msdn api doc,

WSAEWOULDBLOCK | Windows NT:  Overlapped sockets: there are too many outstanding overlapped I/O requests. Nonoverlapped sockets: The socket is marked as nonblocking and the receive operation cannot be completed immediately.

is it handled correctly when the error is other than WSAEWOULDBLOCK ? so I am wondering if the right check here is if (info->wsa_error == 0) Also why not continuing reading more data when the received bytes == the recv buffer size in case there are more data in the socket buffer. there shoudl be a while loop for non-overlapped wsarecv until it got ewouldblock error.

I also suspect if this line of code causes the bug. https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L299 according to msdn api doc,

lpNumberOfBytesRecvd A pointer to the number, in bytes, of data received by this call if the receive operation completes immediately. Use NULL for this parameter if the lpOverlapped parameter is not NULL to avoid potentially erroneous results. This parameter can be NULL only if the lpOverlapped parameter is not NULL.

So I am wondering if &bytes_read shoudl be changed to NULL.

Also I am wondering how this piece of code works ? https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L192 why it is checking tcp->shutting_down instead of checking the retunred error code from wsarecv to detect the local/remote socket shutdown ?
Also why it does not apply the locker when read the value of tcp->shutting_down ? I can see it is protected with locker when set to true. https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L456

Also why not using the erro code from the next time wsarecv or wsasend to detect the local/remote socket abort/shutdown. The use of tcp-shutting_down looks a bit doggy.

Also there is inconsistency here: https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L292 https://github.com/grpc/grpc/blob/ea389c00c2c83e0364c5b1cfb73820c0101daa14/src/core/lib/iomgr/tcp_windows.cc#L311

finally I am wondering if it is correct to mix use of overlapped and non-overlapped wsarecv api ? when you created the socket, I can see it is overlapped socket. but why the non-overlapped wsarecv is used here ? https://github.com/grpc/grpc/blob/fd3bd70939fb4239639fbd26143ec416366e4157/src/core/lib/iomgr/socket_windows.cc#L188

Proposed Solution: The way to operate with winsock in grpc looks a little bit doggy. the winsock apis are quite hard to make it right. Please refer to the libuv as an example about how to operate winsock: either use overlapped or non-overlapped wsarecv. overlapped wsarecv: https://github.com/libuv/libuv/blob/9918a1743816dc49d6c664e41641d78ffd4a0705/src/win/tcp.c#L513 non-overlapped wsarecv: https://github.com/libuv/libuv/blob/9918a1743816dc49d6c664e41641d78ffd4a0705/src/win/tcp.c#L1068

I will recompile the grpc with the proposed changes later and then let us see how it goes.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

I am concerned the fact that the WSABUF buffers[MAX_WSABUF_COUNT] is allocated on the stack is problematic.

@Schramp I believe we discussed this on another issue, and it is fine per the Windows API spec.

yes, we already discussed this (https://github.com/grpc/grpc/pull/28432#discussion_r1027319070), and you rightfully corrected me on my misunderstanding of the API. So you should ignore my remark on the stack allocation of buffers[MAX_WSABUF_COUNT]