gpdb: UDP interconnect packet lost when send EOS cause "ERROR: interconnect encountered a network error"

Bug Report

We encounters this bug when Greenplum cluster is huge and network is busy, so it‘s a bit hard to reproduce the behavior. When the problem happens, we debug the sender slice on a certain segment, and find it’s stack stuck in SendEosUDPIFC(), waiting for acks from receivers, and finally will report ERROR after timeout. ERROR message is : “Failed to send packet (seq 1) to ip:50505 (pid 2126628 cid 7) after 3566 retries in 3600 seconds”; When the sender slice is waiting for acks from receivers, the receiver slice had finished it’s work and states turned to ‘idle’.

I think this issue is caused by different UDP send behaviors:

In sendOnce(), we will check sendto() return value and retry send if necessary, see ic_udpifc.c:4552 :

xmit_retry:
	n = sendto(pEntry->txfd, buf->pkt, buf->pkt->len, 0,
			   (struct sockaddr *) &conn->peer, conn->peer_len);
	if (n < 0)
	{
		if (errno == EINTR)
			goto xmit_retry;

		if (errno == EAGAIN)	/* no space ? not an error. */
			return;

             /* ... */
       }

In sendControlMessage(), we will not handle sendto() failure, see ic_udpifc.c:1778 :

static inline void
sendControlMessage(icpkthdr *pkt, int fd, struct sockaddr *addr, socklen_t peerLen)
{
	int			n;

#ifdef USE_ASSERT_CHECKING
	if (testmode_inject_fault(gp_udpic_dropacks_percent))
	{
#ifdef AMS_VERBOSE_LOGGING
		write_log("THROW CONTROL MESSAGE with seq %d extraSeq %d srcpid %d despid %d", pkt->seq, pkt->extraSeq, pkt->srcPid, pkt->dstPid);
#endif
		return;
	}
#endif

	/* Add CRC for the control message. */
	if (gp_interconnect_full_crc)
		addCRC(pkt);

	n = sendto(fd, (const char *) pkt, pkt->len, 0, addr, peerLen);

	/*
	 * No need to handle EAGAIN here: no-space just means that we dropped the
	 * packet: our ordinary retransmit mechanism will handle that case
	 */

	if (n < pkt->len)
		write_log("sendcontrolmessage: got error %d errno %d seq %d", n, errno, pkt->seq);
}

Receiver slice receive sender slice’s EOS (call sendOnce() ), and send ACK (call sendControlMessage() ) back to sender slice without sendto() check. When the network is not good, it could leads to the problem that receiver will definitely receive EOS and quit Motion, but send slice cannot receive ACK , endless pollAcks and cannot quit Motion.

So, why sendControlMessage() do not check sendto() return and retry? Can we avoid this bug?

Greenplum version or build

6x_stable

OS version and uname -a

CentOS 6

autoconf options used ( config.status --config )

Installation information ( pg_config )

Expected behavior

Actual behavior

Step to reproduce the behavior

About this issue

Original URL
State: open
Created 3 years ago
Comments: 34 (19 by maintainers)

Most upvoted comments

@w517424787 I got it. (but looks my reply mail failed to send to your email addr)

And here is a new written wiki page to introduce how to fix it: https://github.com/greenplum-db/gpdb/wiki/How-to-deal-with-the-error-"Failed-to-send-packet"-of-"interconnect-encountered-a-network-error,-please-check-your-network"

interma on Dec 14, 2023

@wuyuhao28 As you metioned, sendControlMessage() is sent without retry, that is because in normal case, UDPIFC’s retransmit mechanism will handle the normal data packet loss issue, and will retransmit UDP packets with another ACK sent, so there is no need to retry in sendControlMessage(), as the comments pointed out. But for some special case, if the ACK is loss, it will not retransmit with UDP packets, e.g. in the EOS ACK message case. Will check further.

Aegeaner on Jan 4, 2022

Summary this issue again for better understand: In our IC-UDP design, ACK doesn’t need to guarantee 100% reach to the sender, the sender will resend data again if doesn’t receive ACK. It works normally, but there is an exception on the EOS package (a special data).

Consider the below scenario: WechatIMG109 Key points:

star1: the ACK is lost due to bad network.
star2: MotionRecevier finish all its work and delete the corresponding connection with QE1, https://github.com/greenplum-db/gpdb/blob/8a6dd8fba77eaf8dfc4fbca7df9e026d7090a230/src/backend/cdb/motion/ic_udpifc.c#L3288
star3: the resent EOS still can reach to IC process of QE2, but it will be dropped since no corresponding connection in ic_control_info.connHtab (already deleted in star2), https://github.com/greenplum-db/gpdb/blob/a8c0140a96e297f01e86384ccd1c296974580fe9/src/backend/cdb/motion/ic_udpifc.c#L6310-L6319

Improvement ideas:

(First, no way can 100% solve the issue (due to UDP package loss) ).
As I mentioned before, always response ACK for the EOS, don’t consider if it’s in the ic_control_info.connHtab or not.
New timeout mechanisms, e.g. a short transmit_timeout for EOS (but give the hang risk to receiver).
No change, just provide workarounds: like set ic type to tcp.

Welcome to discuss, thanks.

correct what I wrote before, actually, current code already has a similar logic to “always response ACK for the EOS” (even if the motion is teardown): handleMismatch() https://github.com/greenplum-db/gpdb/blob/d11fb106e8ea732b0fd72f938e18ce142476dba3/src/backend/cdb/motion/ic_udpifc.c#L6392-L6404 And I have verified it works in 5x, 6x and 7x.

So why is this issue still happening? All I can think of so far is:

the MR process quitted: (verified by kill it, then query hang here)
or the network is totally broken

Considering both of them are low possibility, so no plan to fix it now and wait for more repro scenarios to understand it better.

interma on Feb 21, 2023

The key point of this issue:

When the network is not good, it could lead to the problem that receiver will definitely receive EOS and quit Motion, but send slice cannot receive ACK, endless pollAcks and cannot quit Motion.

I have a simple idea to improve it: When receiver quits motion, its IC thread still can receive the EOS packages which sender re-sent, but the IC thread just skip it: https://github.com/greenplum-db/gpdb/blob/a8c0140a96e297f01e86384ccd1c296974580fe9/src/backend/cdb/motion/ic_udpifc.c#L6310-L6319 Because the relate conn has been deleted from ic_control_info.connHtab (in TeardownUDPIFC()) .

So we can change this behavior to always giving ACK for the EOS: considering no ACK for EOS will cause hang in motion sender, it makes sense (but need to think if have corner cases). In the worst cases (like network isolation), the hang still exists until reachsGp_interconnect_transmit_timeout , the default value is 1 hour, I think a small value (e.g. 10min) is more reasonable.

Welcome to discuss, thanks.

interma on Nov 29, 2022

Since we cannot prevent the ACK package lost issue, can we make a mechanism that let the sender motion check whether receiver motion had quit?

I think it is hard since the UDP protocol can only provide a best-effort delivery. Maybe we can tune with these two parameters: Gp_interconnect_transmit_timeout and Gp_interconnect_min_retries_before_timeout.

Aegeaner on Jan 4, 2022

Didn’t read too much code about it, just some thoughts: I think even if we do some retries in sendControlMessage(), it still not 100% prevent the similar issue: the ACK package is no guarantee to be sent to receiver (the UDP package maybe missing, need another ACK for this ACK? looks infinite loop…). Seems we need a wise timeout mechanism here.

interma on Jan 4, 2022