kubernetes: kubectl port-forward doesn't properly recover from interrupted connection errors

What happened: When I open a large number (in the 100s) of concurrent long-running HTTP requests against a port-forwarded pod that is running on GKE 1.13.6-gke.0 which I then interrupt/cancel, I observe timeout errors being reported by kubectl port-forward pod/[pod_name] 8000:8000. Eventually, the forwarded port becomes permanently unusable until I relaunch the command.

Types of messages that get logged by kubectl port-forward:

  1. Handling connection for 8000
  2. E0528 16:12:42.255465 13724 portforward.go:303] error copying from remote stream to local connection: readfrom tcp4 127.0.0.1:8000->127.0.0.1:62974: write tcp4 127.0.0.1:8000->127.0.0.1:62974: wsasend: An established connection was aborted by the software in your host machine.
  3. E0528 16:13:12.495226 13724 portforward.go:293] error creating forwarding stream for port 8000 -> 8000: Timeout occured
  4. E0528 16:13:53.891935 13724 portforward.go:271] error creating error stream for port 8000 -> 8000: Timeout occured

After the port-forward becomes permanently unusable, I still see incoming requests logged as:

Handling connection for 8000

which then all fail about 30 seconds later with the following error, without the request being set to the pod:

E0528 16:38:06.245530    6668 portforward.go:271] error creating error stream for port 8000 -> 8000: Timeout occured

What you expected to happen:

  1. Never have to relaunch kubectl port-forward because it is in a permanent failed state
  2. All concurrent long-running HTTP requests to succeed (up to some reasonable amount of them of course)

How to reproduce it (as minimally and precisely as possible):

  1. Create an nginx pod serving a large 1GB static file:
    kubectl run -ti --rm --restart=Never --image=nginx nginx -- bash "-c" "dd if=/dev/zero of=/usr/share/nginx/html/large_file.bin count=1024 bs=1048576 && nginx -g 'daemon off;'"
    
  2. Port-forward it: kubectl port-forward nginx 1234:80
  3. Run Apache Bench with 100 total requests, 100 concurrent ones and 1 second timeout, that you interrupt using Control+C (for some reason the timeout doesn’t seem to be well enforced):
    ab -n 100 -c 100 -s 1 http://127.0.0.1:1234/large_file.bin
    
    Note that in my initial use-case, I managed to get this issue simply by quickly refreshing the URL in Google Chrome without any benchmarking tool.
  4. If needed, re-run the same Apache Bench command until the port-forwarding fails to serve all incoming requests
  5. Observe that now all requests to the nginx servers, even to http://127.0.0.1:1234/index.html are all failing. Additionally, nginx doesn’t output those requests in its logs.
  6. Relaunch the kubectl port-forward nginx 1234:80 command, and the URL becomes available again showing that the pod was still healthy.

Anything else we need to know?:

Possibility related issues & PRs:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.8-dispatcher", GitCommit:"1215389331387f57594b42c5dd024a2fe27334f8", GitTreeState:"clean", BuildDate:"2019-05-13T18:28:02Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.6-gke.0", GitCommit:"14c9138d6fb5b57473e270fe8a2973300fbd6fd6", GitTreeState:"clean", BuildDate:"2019-05-08T16:22:55Z", GoVersion:"go1.11.5b4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: GKE 1.13.6-gke.0
  • OS (e.g: cat /etc/os-release): Windows
  • Kernel (e.g. uname -a): Windows
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 49
  • Comments: 51 (4 by maintainers)

Most upvoted comments

Same error when port-forwarding to AWS EKS pod:

E0806 17:04:44.805492 78962 portforward.go:385] error copying from local connection to remote stream: read tcp6 [::1]:3000->[::1]:61216: read: connection reset by peer

kubectl version:

Client Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.2”, GitCommit:“f6278300bebbb750328ac16ee6dd3aa7d3549568”, GitTreeState:“clean”, BuildDate:“2019-08-05T16:54:35Z”, GoVersion:“go1.12.7”, Compiler:“gc”, Platform:“darwin/amd64”}

Does anybody have a solid workaround for this?

I’m giving up on this

Not a true solution, but helps in the meantime

while true; do <port-forward command> done

Does anybody have a solid workaround for this?

I tried @moewill’s loop solution, but most of the time, kubectl wouldn’t exit when the error occurred, leaving it stranded in a failed state until I manually hit CTRL+C (and the loop relaunched it).

To address the problem, I cloned the kubectl source and modified it to exit whenever an error occurs. This appears to be a pretty good workaround for my use case since when the loop relaunches it, everything starts working again. Maybe this workaround would help folks out until a proper fix is merged. I’ll try to put together a pull request.

While this is no solution for the port-forward recovering properly, I did find it solved my use-case.

When trying to tunnel to a mysql box we couldn’t keep a connection alive for more than ~60 seconds, completely ruining the experience for anyone running long-running queries.

By creating a new connection regularly we seemed to be able to keep the port-forward tunnel open.

  • Start your port-forward command
  • Write something that can ping the pod every 20 seconds on the target port, for our usecase we just have mysqladmin ping -h 127.0.0.1 running in a loop
  • Run your intended long-running command alongside this ping script.

Here’s (only) my case: all of the usual networking options normally involve exposing the service to the internet.

  • From the above comments, many people use a basic version of grafana, probably not configured for internet exposure (or sometimes company policy prevents it anyway). I used this example because grafana is exceptionally easy to break through port-forward (it breaks within minutes, requiring a restart of the forwarding).
  • we use it mostly for developers who already have access to their namespaces for easy access to backend services that should also not be exposed to the internet. They already have their service account and RBAC set up. It’s useful to be able to push or pull a database dump. It’s much harder when your connection tunnel is unstable.

On the tools side, many tools (minikube for example) seem to rely on port-forward too.

On the dev-side (using minikube or other dev-cluster tools that don’t provide a stable IP for your Nodeports/Loadbalancers) it’s just very convenient to just know that you can use localhost with a known port in your scripts instead of looking up the IP every time (and maybe changing your hosts file).

port-forward is the only tunneling approach (that I know) that doesn’t need elevation (or root on linux).

Since it looks like this isn’t likely to get fixed any time soon:

Here’s a rather ugly hack/workaround Bash script to automatically restart the port forwarding when it crashes.

Change the value variable to the appropriate port forwarding command.

It will run port forwarding, then parse the output stderr for errors; if the errors contain “portforward.go” (which these kind of errors do), then it will restart; if the port forwarding exits for any other reason (such as the process being killed, or a legitimate error, then it will exit instead.

This could probably be written nicer, but it works for my purposes.

value='kubectl port-forward svc/grafana 3000:80'
while true; do
      echo "--> $value"
      $value 2>&1 >/dev/null |
      while IFS= read -r line
      do
            echo "### $line"
            if [[ "$line" == *"portforward.go"* ]]; then
                  echo "Retarting port forwarding $value"
                  exit 1
            else
                  exit 0
            fi
      done
      if [ $? -eq 0 ]; then
            break;
      fi
done

Thanks @atinsinghal97. This does seem to help. For my particular issue I believe the fault is with Grafana (older builds have no issues) however this may be useful for testing that I’m doing sending OpenTelemetry data forward from my local machine.

this is not fixed yet, come on, random problems are the worst