kubernetes: kubectl port-forward doesn't properly recover from interrupted connection errors
What happened:
When I open a large number (in the 100s) of concurrent long-running HTTP requests against a port-forwarded pod that is running on GKE 1.13.6-gke.0 which I then interrupt/cancel, I observe timeout errors being reported by kubectl port-forward pod/[pod_name] 8000:8000
. Eventually, the forwarded port becomes permanently unusable until I relaunch the command.
Types of messages that get logged by kubectl port-forward
:
Handling connection for 8000
E0528 16:12:42.255465 13724 portforward.go:303] error copying from remote stream to local connection: readfrom tcp4 127.0.0.1:8000->127.0.0.1:62974: write tcp4 127.0.0.1:8000->127.0.0.1:62974: wsasend: An established connection was aborted by the software in your host machine.
E0528 16:13:12.495226 13724 portforward.go:293] error creating forwarding stream for port 8000 -> 8000: Timeout occured
E0528 16:13:53.891935 13724 portforward.go:271] error creating error stream for port 8000 -> 8000: Timeout occured
After the port-forward becomes permanently unusable, I still see incoming requests logged as:
Handling connection for 8000
which then all fail about 30 seconds later with the following error, without the request being set to the pod:
E0528 16:38:06.245530 6668 portforward.go:271] error creating error stream for port 8000 -> 8000: Timeout occured
What you expected to happen:
- Never have to relaunch
kubectl port-forward
because it is in a permanent failed state - All concurrent long-running HTTP requests to succeed (up to some reasonable amount of them of course)
How to reproduce it (as minimally and precisely as possible):
- Create an nginx pod serving a large 1GB static file:
kubectl run -ti --rm --restart=Never --image=nginx nginx -- bash "-c" "dd if=/dev/zero of=/usr/share/nginx/html/large_file.bin count=1024 bs=1048576 && nginx -g 'daemon off;'"
- Port-forward it:
kubectl port-forward nginx 1234:80
- Run Apache Bench with 100 total requests, 100 concurrent ones and 1 second timeout, that you interrupt using Control+C (for some reason the timeout doesn’t seem to be well enforced):
Note that in my initial use-case, I managed to get this issue simply by quickly refreshing the URL in Google Chrome without any benchmarking tool.ab -n 100 -c 100 -s 1 http://127.0.0.1:1234/large_file.bin
- If needed, re-run the same Apache Bench command until the port-forwarding fails to serve all incoming requests
- Observe that now all requests to the nginx servers, even to http://127.0.0.1:1234/index.html are all failing. Additionally, nginx doesn’t output those requests in its logs.
- Relaunch the
kubectl port-forward nginx 1234:80
command, and the URL becomes available again showing that the pod was still healthy.
Anything else we need to know?:
Possibility related issues & PRs:
- https://github.com/openshift/origin/issues/4287 - reports a similar bug
- https://github.com/kubernetes/kubernetes/issues/13673
- https://github.com/kubernetes/kubernetes/pull/12283
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.8-dispatcher", GitCommit:"1215389331387f57594b42c5dd024a2fe27334f8", GitTreeState:"clean", BuildDate:"2019-05-13T18:28:02Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.6-gke.0", GitCommit:"14c9138d6fb5b57473e270fe8a2973300fbd6fd6", GitTreeState:"clean", BuildDate:"2019-05-08T16:22:55Z", GoVersion:"go1.11.5b4", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: GKE 1.13.6-gke.0
- OS (e.g:
cat /etc/os-release
): Windows - Kernel (e.g.
uname -a
): Windows - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 49
- Comments: 51 (4 by maintainers)
Same error when port-forwarding to AWS EKS pod:
E0806 17:04:44.805492 78962 portforward.go:385] error copying from local connection to remote stream: read tcp6 [::1]:3000->[::1]:61216: read: connection reset by peer
kubectl version:
Does anybody have a solid workaround for this?
I’m giving up on this
Not a true solution, but helps in the meantime
while true; do <port-forward command> done
I tried @moewill’s loop solution, but most of the time, kubectl wouldn’t exit when the error occurred, leaving it stranded in a failed state until I manually hit CTRL+C (and the loop relaunched it).
To address the problem, I cloned the kubectl source and modified it to exit whenever an error occurs. This appears to be a pretty good workaround for my use case since when the loop relaunches it, everything starts working again. Maybe this workaround would help folks out until a proper fix is merged. I’ll try to put together a pull request.
duplicate of https://github.com/kubernetes/kubectl/issues/686
While this is no solution for the port-forward recovering properly, I did find it solved my use-case.
When trying to tunnel to a mysql box we couldn’t keep a connection alive for more than ~60 seconds, completely ruining the experience for anyone running long-running queries.
By creating a new connection regularly we seemed to be able to keep the port-forward tunnel open.
mysqladmin ping -h 127.0.0.1
running in a loopHere’s (only) my case: all of the usual networking options normally involve exposing the service to the internet.
On the tools side, many tools (minikube for example) seem to rely on port-forward too.
On the dev-side (using minikube or other dev-cluster tools that don’t provide a stable IP for your Nodeports/Loadbalancers) it’s just very convenient to just know that you can use
localhost
with a known port in your scripts instead of looking up the IP every time (and maybe changing your hosts file).port-forward is the only tunneling approach (that I know) that doesn’t need elevation (or root on linux).
Since it looks like this isn’t likely to get fixed any time soon:
Here’s a rather ugly hack/workaround Bash script to automatically restart the port forwarding when it crashes.
Change the value variable to the appropriate port forwarding command.
It will run port forwarding, then parse the output stderr for errors; if the errors contain “portforward.go” (which these kind of errors do), then it will restart; if the port forwarding exits for any other reason (such as the process being killed, or a legitimate error, then it will exit instead.
This could probably be written nicer, but it works for my purposes.
Thanks @atinsinghal97. This does seem to help. For my particular issue I believe the fault is with Grafana (older builds have no issues) however this may be useful for testing that I’m doing sending OpenTelemetry data forward from my local machine.
this is not fixed yet, come on, random problems are the worst