brpc: RDMA failure

Describe the bug (描述bug)

W0528 17:36:56.793694   510 external/brpc/src/brpc/input_messenger.cpp:240] Close Socket{id=1248 fd=1395 addr=10.156.8.12:24452:24141} (0x55c9fa744b40) due to unknown message: \F87\88>[\BE\C0<\00\00\00\80\CA|\E1=\F87\88>[\BE\C0<\CA\E0U\BE\CA|\E1=\F87\88>[\BE\C0<\00\00\00\80\CA|\E1=\F87\88>[\BE\C0<S}:=O\E9\82\BD...<skipping 16256 bytes>
W0528 17:36:56.794202   510 external/brpc/src/brpc/policy/baidu_rpc_protocol.cpp:265] Fail to write into Socket{id=1248 fd=1395 addr=10.156.8.12:24452:24141} (0x55c9fa744b40): Invalid argument

To Reproduce (复现方法) master分支,RDMA打开的情况,数小时之内必现

Expected behavior (期望行为)

Versions (各种版本) OS: Compiler: brpc: protobuf:

Additional context/screenshots (更多上下文/截图)

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 73 (73 by maintainers)

Most upvoted comments

rdma_recv_zerocopy=false(禁用接收端zerocopy),rdma_zerocopy_min_size=[比较大的值](禁用发送端zerocopy)

抱歉,我刚查了一下代码,上面这个说法有问题。rdma_zerocopy_min_size=[比较大的值]禁用的仍然是接收端zerocopy。发送端zerocopy暂时没法通过Flag禁用。我去做个版本,支持下关闭发送端zerocopy你再试试

这类错误可能是多个线程同时访问一个IOBuf导致的。先禁掉zero copy看看是否还必现?rdma_recv_zerocopy=false(禁用接收端zerocopy),rdma_zerocopy_min_size=[比较大的值](禁用发送端zerocopy)。先确认下是rpc内部竞争了,还是rpc和上面的应用之间竞争了。