gpdb: UDP interconnect packet lost when send EOS cause "ERROR: interconnect encountered a network error"
Bug Report
We encounters this bug when Greenplum cluster is huge and network is busy, so it‘s a bit hard to reproduce the behavior. When the problem happens, we debug the sender slice on a certain segment, and find it’s stack stuck in SendEosUDPIFC(), waiting for acks from receivers, and finally will report ERROR after timeout. ERROR message is : “Failed to send packet (seq 1) to ip:50505 (pid 2126628 cid 7) after 3566 retries in 3600 seconds”; When the sender slice is waiting for acks from receivers, the receiver slice had finished it’s work and states turned to ‘idle’.
I think this issue is caused by different UDP send behaviors:
- In sendOnce(), we will check sendto() return value and retry send if necessary, see ic_udpifc.c:4552 :
xmit_retry:
n = sendto(pEntry->txfd, buf->pkt, buf->pkt->len, 0,
(struct sockaddr *) &conn->peer, conn->peer_len);
if (n < 0)
{
if (errno == EINTR)
goto xmit_retry;
if (errno == EAGAIN) /* no space ? not an error. */
return;
/* ... */
}
- In sendControlMessage(), we will not handle sendto() failure, see ic_udpifc.c:1778 :
static inline void
sendControlMessage(icpkthdr *pkt, int fd, struct sockaddr *addr, socklen_t peerLen)
{
int n;
#ifdef USE_ASSERT_CHECKING
if (testmode_inject_fault(gp_udpic_dropacks_percent))
{
#ifdef AMS_VERBOSE_LOGGING
write_log("THROW CONTROL MESSAGE with seq %d extraSeq %d srcpid %d despid %d", pkt->seq, pkt->extraSeq, pkt->srcPid, pkt->dstPid);
#endif
return;
}
#endif
/* Add CRC for the control message. */
if (gp_interconnect_full_crc)
addCRC(pkt);
n = sendto(fd, (const char *) pkt, pkt->len, 0, addr, peerLen);
/*
* No need to handle EAGAIN here: no-space just means that we dropped the
* packet: our ordinary retransmit mechanism will handle that case
*/
if (n < pkt->len)
write_log("sendcontrolmessage: got error %d errno %d seq %d", n, errno, pkt->seq);
}
Receiver slice receive sender slice’s EOS (call sendOnce() ), and send ACK (call sendControlMessage() ) back to sender slice without sendto() check. When the network is not good, it could leads to the problem that receiver will definitely receive EOS and quit Motion, but send slice cannot receive ACK , endless pollAcks and cannot quit Motion.
So, why sendControlMessage() do not check sendto() return and retry? Can we avoid this bug?
Greenplum version or build
6x_stable
OS version and uname -a
CentOS 6
autoconf options used ( config.status --config )
Installation information ( pg_config )
Expected behavior
Actual behavior
Step to reproduce the behavior
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 34 (19 by maintainers)
@w517424787 I got it. (but looks my reply mail failed to send to your email addr)
And here is a new written wiki page to introduce how to fix it: https://github.com/greenplum-db/gpdb/wiki/How-to-deal-with-the-error-"Failed-to-send-packet"-of-"interconnect-encountered-a-network-error,-please-check-your-network"
@wuyuhao28 As you metioned,
sendControlMessage()is sent without retry, that is because in normal case, UDPIFC’s retransmit mechanism will handle the normal data packet loss issue, and will retransmit UDP packets with another ACK sent, so there is no need to retry insendControlMessage(), as the comments pointed out. But for some special case, if the ACK is loss, it will not retransmit with UDP packets, e.g. in the EOS ACK message case. Will check further.Summary this issue again for better understand: In our IC-UDP design, ACK doesn’t need to guarantee 100% reach to the sender, the sender will resend data again if doesn’t receive ACK. It works normally, but there is an exception on the EOS package (a special data).
Consider the below scenario:
Key points:
ic_control_info.connHtab(already deleted in star2), https://github.com/greenplum-db/gpdb/blob/a8c0140a96e297f01e86384ccd1c296974580fe9/src/backend/cdb/motion/ic_udpifc.c#L6310-L6319Improvement ideas:
ic_control_info.connHtabor not.Welcome to discuss, thanks.
correct what I wrote before, actually, current code already has a similar logic to “always response ACK for the EOS” (even if the motion is teardown):
handleMismatch()https://github.com/greenplum-db/gpdb/blob/d11fb106e8ea732b0fd72f938e18ce142476dba3/src/backend/cdb/motion/ic_udpifc.c#L6392-L6404 And I have verified it works in 5x, 6x and 7x.So why is this issue still happening? All I can think of so far is:
Considering both of them are low possibility, so no plan to fix it now and wait for more repro scenarios to understand it better.
The key point of this issue:
I have a simple idea to improve it: When receiver quits motion, its IC thread still can receive the EOS packages which sender re-sent, but the IC thread just skip it: https://github.com/greenplum-db/gpdb/blob/a8c0140a96e297f01e86384ccd1c296974580fe9/src/backend/cdb/motion/ic_udpifc.c#L6310-L6319 Because the relate
connhas been deleted fromic_control_info.connHtab(inTeardownUDPIFC()) .So we can change this behavior to always giving ACK for the EOS: considering no ACK for EOS will cause hang in motion sender, it makes sense (but need to think if have corner cases). In the worst cases (like network isolation), the hang still exists until reachs
Gp_interconnect_transmit_timeout, the default value is 1 hour, I think a small value (e.g. 10min) is more reasonable.Welcome to discuss, thanks.
I think it is hard since the UDP protocol can only provide a best-effort delivery. Maybe we can tune with these two parameters:
Gp_interconnect_transmit_timeoutandGp_interconnect_min_retries_before_timeout.Didn’t read too much code about it, just some thoughts: I think even if we do some retries in
sendControlMessage(), it still not 100% prevent the similar issue: the ACK package is no guarantee to be sent to receiver (the UDP package maybe missing, need another ACK for this ACK? looks infinite loop…). Seems we need a wise timeout mechanism here.