tailscale: Slow throughput over tsnet compared to Windows tailscaled

What is the issue?

I have a go http server that serves up https://github.com/librespeed/speedtest.

I made this server accessible on my tailnet, using both system tailscale (windows 1.50) and embedded tsnet.

On a remote node across the country where I am accessing the speedtest, I confirmed via tailscale status there’s a direct connection to both IPs.

When I invoke the browser based speed test, the speeds shown on the system tailscale IP will stabilize after a few seconds and in my case measure 35mbps. Hitting the same server on the same node using its tsnet IP, the speeds never stabilize, keep dropping, and are near 4mbps at the end of the test.

cc @eric

Steps to reproduce

full standalone repro to come

here are the gin-gonic routes being used to drive the speed test:

	r.GET("/garbage", func(c *gin.Context) {
		c.Header("Content-Description", "File Transfer")
		c.Header("Content-Type", "application/octet-stream")
		c.Header("Content-Disposition", "attachment; filename=random.dat")
		c.Header("Content-Transfer-Encoding", "binary")
		ckSize, _ := strconv.ParseInt(c.Query("ckSize"), 10, 64)
		if ckSize == 0 || ckSize > 2048 {
			ckSize = 4
		}
		data := make([]byte, 1024*1024)
		io.ReadAtLeast(rand.Reader, data, len(data))
		for ckSize > 0 {
			_, err := c.Writer.Write(data)
			if err != nil {
				break
			}
			ckSize--
		}
	})

	r.GET("/empty", func(c *gin.Context) {
		c.Status(200)
	})

Are there any recent changes that introduced the issue?

Not sure, but I upgraded tsnet from 1.48 to main (which includes a gvisor bump) and did not observe any improvement.

I also ran the same test against a linux and macos server and observed the same behavior where the speed declines continuously over tsnet

OS

Windows

OS version

Windows 11

Tailscale version

1.51.0-dev20231007

Other software

No response

Bug report

No response

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 25 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Traffic is being generated by the tsnet app to send to the Mac. The Mac receives the frames and generates ACKs back to tsnet.

1/ Looking at the tsnet app sending packets, to the left of the 1.08 second timestamp, we see tsnet send a burst of four packets followed a short time later by a burst of three packets. b9707 tsnet TSG


2/ Looking at the Mac between 1.2 and 1.205 seconds: b9707 mac TSG The timestamps are relative to the start of the capture and not synchronized between the Mac and tsnet systems, don’t pay attention to 1.08 versus 1.2 seconds.

  • The Mac receives four packets, and generates an ACK which goes out almost immediately.
  • The Mac receives three more packets a short time later, and generates another ACK almost immediately.

3/ Looking at tsnet again, the ACKs sent by the MAC are not received until almost 1.086. Zooming in on that timestamp:

b9707 tsnet TSG zoomed in
  • We see tsnet decide to retransmit all seven packets, the set of four and the set of three. It also sends several additional, fresh packets.
  • At that very instant it receives an ACK for seven frames.

Clearly this was the ACKs for four packets and for three packets which the Mac sent some time ago in response to receiving the first transmission of these packets. They can’t possibly arrive instantaneously after retransmitting, the ACKs must have been sitting in a queue waiting to be processed.

To me it looks like tsnet is only processing ACKs when it decides to transmit something, so it finally receives the ACK after already having decided to retransmit. The retransmission is unnecessary.

There’s definitely some kind of bug in the flow with gvisor, and it is significantly worsened with SACK enabled. The results are also skewed when RACK is enabled, but disabling both does not result in satisfactory statistics on a local link.

I still for example see ~15k fast retransmits over a run with both RACK and SACK disabled. Disabling SACK will however improve throughput by 60-100%, depending on some other luck - this effectively mitigates part of the problem, but it’s not a solution.

I suspect this probably is also happening on Linux as well, but the impact may be lower.

I’ve not yet validated the part that this may not be occurring when running in privileged mode, certainly I would expect to have noticed if it was occurring in tun mode. There is no platform specific code on the gvisor side related to timers and retries, intuitions suggests this may be an issue with unprivileged userspace failing to set socket buffers and then bursting far too much traffic. What isn’t immediately obvious from that theory is why the window size doesn’t eventually get this under control, as retransmits continue throughout a test run. Plenty to look into still.

cc @jwhited who will also be interested in this.

There is no mechanism by which Windows would retransmit a UDP packet, UDP is send-and-forget. The retransmission is gvisor within the tsnet app deciding to re-send a TCP packet.

Because there is clearly some sort of Windows involvement in the result, my guess would be that either:

  • Windows is losing small UDP frames containing the ACKs back from the Mac. The pcap is gathered on the Mac, which only knows that it saw them going out and doesn’t know whether Windows received them.
  • Windows is delaying small UDP frames containing the ACKs back from the Mac.
  • Something in goroutine scheduling specific to Windows is resulting in not processing the ACKs in a timely way.

Using a Windows 11 VM running https://github.com/fancybits/tsnetspeedtest, I get about 25 Mbps downloaded from tsnet. A time-sequence graph from wireshark is attached. Zooming way in we see a lot of selective ACK and retransmissions.

Screenshot 2023-10-09 at 10 25 47 AM Screenshot 2023-10-09 at 10 26 45 AM

Using the LAN to the same VM achieves 500 Mbps. A time-sequence graph shows no significant retransmissions.

Screenshot 2023-10-09 at 10 33 26 AM Screenshot 2023-10-09 at 10 34 20 AM

I think this is as far as we’re going to be able to take it for now. We’d need to have a full team of engineers go dig into netstack at this point and then push any fixes upstream, and right now we don’t have staff available to do that. It will be a while before we can do significantly more work on tsnet throughput for Windows.

@tmm1 & @eric - can you re-test with https://github.com/tailscale/tailscale/commit/5e861c38718ffcde3ded6d2922ca464886e41321 or later? This commit improves performance significantly in my local testing of tsnet on Windows.

The results I gathered from before the change aligns with @DentonGentry’s findings in that gVisor appears to be slow to handle ACKs, or possibly mishandles them until one w/SACK option is seen. Without TCP RACK this behavior is less apparent as seen via less spurious retransmissions, and higher throughput.

I added a runtime/debug PrintBackTrace() in injectInbound(), to be sure of how it was being called.

The backtrace is:

2023/11/25 11:11:50 [v2] packet in (from 100.100.165.16:49170): 45 00 00 28 9e f2 00 00 40 06 31 5b 64 64 a5 10 64 63 3c ab c0 12 00 50 39 ad 35 5f 00 00 00 00 50 04 00 00 d5 ee 00 00
goroutine 411 [running]:
runtime/debug.Stack()
        C:/Program Files/Go/src/runtime/debug/stack.go:24 +0x5e
runtime/debug.PrintStack()
        C:/Program Files/Go/src/runtime/debug/stack.go:16 +0x13
tailscale.com/wgengine/netstack.(*Impl).injectInbound(0xc000136840, 0xc0001d2a90, 0xc0001d0630?)
        C:/Users/DGentry/go/pkg/mod/tailscale.com@v1.1.1-0.20231125190428-20fa0b3d59f9/wgengine/netstack/netstack.go:752 +0x848
tailscale.com/net/tstun.(*Wrapper).filterPacketInboundFromWireGuard(0xc000226420, 0xc0001d2a90, 0xc0006842f0?)
        C:/Users/DGentry/go/pkg/mod/tailscale.com@v1.1.1-0.20231125190428-20fa0b3d59f9/net/tstun/wrap.go:1050 +0x785
tailscale.com/net/tstun.(*Wrapper).Write(0xc000226420, {0xc0001198c0, 0x1, 0x1}, 0x10)
        C:/Users/DGentry/go/pkg/mod/tailscale.com@v1.1.1-0.20231125190428-20fa0b3d59f9/net/tstun/wrap.go:1070 +0x234
github.com/tailscale/wireguard-go/device.(*Peer).RoutineSequentialReceiver(0xc00061c300, 0x1)
        C:/Users/DGentry/go/pkg/mod/github.com/tailscale/wireguard-go@v0.0.0-20231121184858-cc193a0b3272/device/receive.go:525 +0x2ff
created by github.com/tailscale/wireguard-go/device.(*Peer).Start in goroutine 409
        C:/Users/DGentry/go/pkg/mod/github.com/tailscale/wireguard-go@v0.0.0-20231121184858-cc193a0b3272/device/peer.go:209 +0x2e5

This calls netstack.go:injectInboud(): https://github.com/tailscale/tailscale/blob/d5d84f1a68dc9495ed7b91b4655a869268ac3cfd/wgengine/netstack/netstack.go#L710

which calls ns.linkEP.InjectInbound() to send the packet in to netstack: https://github.com/tailscale/tailscale/blob/d5d84f1a68dc9495ed7b91b4655a869268ac3cfd/wgengine/netstack/netstack.go#L753

The Windows machine appears to be retransmitting packets which shouldn’t need to be retransmitted.

The test setup is a browser on a Mac connecting over Tailscale to the tsspeedtest app running on Windows. The pcaps are gathered on the Mac, where we have a full Tailscale installed using a tun device where we can capture the cleartext content inside the Wireguard tunnel. We don’t have a way to get a pcap of the cleartext from inside the tsnet app on Windows.

This is called a time-sequence graph, which shows packets transmitted as vertical lines, and the ACK going back as horizontal. It can be found in the Wireshark Statistics -> TCP Stream Graph selection.

Screenshot 2023-10-09 at 5 31 51 PM

The red lines are four packets which the Windows machine decided to retransmit for unclear reasons. The Mac had sent back an ACK, but the Windows system retransmitted anyway. The Windows system appears to either be losing ACKs, or at least is very tardy in processing them.

Packet dissections from the PCAP of the retransmission marked “A” are show below.

  • Packet 5284 is the first vertical line where the Windows system sent a chunk of data for the first time.
  • Packet 5289 is the ACK from the Mac back to Windows.
  • Packet 5332 is the Windows system retransmitting the packet, which should have already been successfully acknowledged.
  • Packet 5334 is the Mac sending another ACK for the retransmission.
  • Packet 5335 is the Mac re-sending the ACK yet again, presumably just for good measure.
No.     Time           Source                Destination           Protocol Length Info
   5284 28.940318      100.71.236.150        100.100.165.16        TCP      1252   80 → 55535 [ACK] Seq=800897 Ack=393 Win=524288 Len=1200 TSval=3710253163 TSecr=2922268938 [TCP segment of a reassembled PDU]

Frame 5284: 1252 bytes on wire (10016 bits), 1252 bytes captured (10016 bits) on interface utun5, id 0
Raw packet data
Internet Protocol Version 4, Src: 100.71.236.150, Dst: 100.100.165.16
Transmission Control Protocol, Src Port: 80, Dst Port: 55535, Seq: 800897, Ack: 393, Len: 1200

No.     Time           Source                Destination           Protocol Length Info
   5289 28.940389      100.100.165.16        100.71.236.150        TCP      52     55535 → 80 [ACK] Seq=393 Ack=802097 Win=512960 Len=0 TSval=2922268946 TSecr=3710253163

Frame 5289: 52 bytes on wire (416 bits), 52 bytes captured (416 bits) on interface utun5, id 0
Raw packet data
Internet Protocol Version 4, Src: 100.100.165.16, Dst: 100.71.236.150
Transmission Control Protocol, Src Port: 55535, Dst Port: 80, Seq: 393, Ack: 802097, Len: 0

No.     Time           Source                Destination           Protocol Length Info
   5332 28.948753      100.71.236.150        100.100.165.16        TCP      1252   [TCP Spurious Retransmission] 80 → 55535 [ACK] Seq=800897 Ack=393 Win=524288 Len=1200 TSval=3710253171 TSecr=2922268944

Frame 5332: 1252 bytes on wire (10016 bits), 1252 bytes captured (10016 bits) on interface utun5, id 0
Raw packet data
Internet Protocol Version 4, Src: 100.71.236.150, Dst: 100.100.165.16
Transmission Control Protocol, Src Port: 80, Dst Port: 55535, Seq: 800897, Ack: 393, Len: 1200

No.     Time           Source                Destination           Protocol Length Info
   5334 28.948772      100.100.165.16        100.71.236.150        TCP      64     [TCP Dup ACK 5289#2] 55535 → 80 [ACK] Seq=393 Ack=802097 Win=512960 Len=0 TSval=2922268955 TSecr=3710253163 SLE=798497 SRE=800897

Frame 5334: 64 bytes on wire (512 bits), 64 bytes captured (512 bits) on interface utun5, id 0
Raw packet data
Internet Protocol Version 4, Src: 100.100.165.16, Dst: 100.71.236.150
Transmission Control Protocol, Src Port: 55535, Dst Port: 80, Seq: 393, Ack: 802097, Len: 0

No.     Time           Source                Destination           Protocol Length Info
   5335 28.948791      100.100.165.16        100.71.236.150        TCP      64     [TCP Dup ACK 5289#3] 55535 → 80 [ACK] Seq=393 Ack=802097 Win=512960 Len=0 TSval=2922268955 TSecr=3710253171 SLE=800897 SRE=802097

Frame 5335: 64 bytes on wire (512 bits), 64 bytes captured (512 bits) on interface utun5, id 0
Raw packet data
Internet Protocol Version 4, Src: 100.100.165.16, Dst: 100.71.236.150
Transmission Control Protocol, Src Port: 55535, Dst Port: 80, Seq: 393, Ack: 802097, Len: 0

The impact of this is more than just the airtime spent retransmitting. The netstack TCP implementation will treat these retransmissions as congestion, and slow down.