kubernetes: kubectl port-forward broken pipe
What happened:
We have a pod running in our k8s cluster that I connect to via kubectl port-forward
. I am able to connect to this pod (using the following command) but then start getting broken pipe error messages after maintaining a connection for what is typically 30-60s.
kubectl port-forward --namespace monitoring deployment/cost-monitor 9090
There seems to be a correlation between error rate and the amount of data being transferred. I see the following error message initially:
E0225 15:20:06.212139 26392 portforward.go:363] error copying from remote stream to local connection: readfrom tcp6 [::1]:9090->[::1]:57794: write tcp6 [::1]:9090->[::1]:57794: write: broken pipe
These errors are oftentimes followed by timeout messages but necessarily immediately:
E0225 15:22:30.454203 26392 portforward.go:353] error creating forwarding stream for port 9090 -> 9090: Timeout occured
What you expected to happen: No error messages or major degradation in transfer rate.
How to reproduce it (as minimally and precisely as possible): Connect via port-forward and transfer ~5 mb over several minutes.
What else?: We have multiple HTTP requests being made at any given time on this port-forward connection.
Environment: Experiencing on both AWS kops and GKE
// kops Client Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.3”, GitCommit:“721bfa751924da8d1680787490c54b9179b1fed0”, GitTreeState:“clean”, BuildDate:“2019-02-04T04:48:03Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“10”, GitVersion:“v1.10.11”, GitCommit:“637c7e288581ee40ab4ca210618a89a555b6e7e9”, GitTreeState:“clean”, BuildDate:“2018-11-26T14:25:46Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”}
// GKE Client Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.3”, GitCommit:“721bfa751924da8d1680787490c54b9179b1fed0”, GitTreeState:“clean”, BuildDate:“2019-02-04T04:48:03Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“10+”, GitVersion:“v1.10.11-gke.1”, GitCommit:“5c4fddf874319c9825581cc9ab1d0f0cf51e1dc9”, GitTreeState:“clean”, BuildDate:“2018-11-30T16:18:58Z”, GoVersion:“go1.9.3b4”, Compiler:“gc”, Platform:“linux/amd64”}
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 162
- Comments: 102 (25 by maintainers)
I had a similar problem using
kubectl port-forward
and I resolved it withulimit -n 65536
on Mac OS.I ran
ulimit -n 65536
as-is. You might needsudo
on your system. This increases the file descriptor limit of the local shell where you’re runningkubectl port-forward
.My hypothesis is that
kubectl port-forward
doesn’t clean up its sockets properly so the local shell runs into the file descriptor limit after some time under high load (or maybe after terminating a few times). This seemed to stop port-forward from breaking all the time when I was running automated tests against a K8s service I was trying to debug.Just noticed this myself with a fresh single node installation.
Port forwarding didn’t work at first
When I let kubectl decide the host port, port-forwarding worked!
Then forwarding to the same port on the host worked!
What’s the fix for this
I’m trying to locate and fix this issue 【WIP】
thx for https://github.com/kargakis/k8s-74551-reproducer
environment:
【Please correct me if I’m wrong】
Here are some picture for troubleshooting:
that is kubectl.
https://github.com/containerd/containerd/blob/290a800e83d5460207cf198516359bfd2b5038d6/pkg/cri/streaming/portforward/httpstream.go#L258So I made some fixes to containerd, https://github.com/sxllwx/containerd/commit/28755fff7d67b64576054e4fbc4845e116d92b63ji
I’m running the failed example fine with containerd using this patch.
I had success with @anthcor 's solution: let him decide the local port.
I ALSO had success by specifying local address as 127.0.0.1 (I don’t need ipv6). The WEIRD thing is after doing this I can go back to the original form,
kubectl port-forward svc/foo 1234:8080
and it works again. This smells like a socket reuse issue.This is docker-desktop on a mac.
/assign
I have similar problem, this occurs in azure k8s cluster(client: v1.19.3, server: v1.17.3) and in local
kind
cluster (client: v1.19.3, server: v1.19.1). It’s looks very similar like @ysimonson suggestion here. And (I guess) its related with http streams.Problem occurs when you stop downloading webpage/file during the downloading is in progress. Then
broken pipe
logs appears, and after a whileTimeout occured
. After that, everything stops, and connection is unusable.I don’t exactly know how k8s/kublet works in details, but this logs from
containerd
always appear when connection is broken.When problem occurs in “production” this logs have a hundreds of connections inside
[]
.Long story short: I prepared reproduction of this bug with github actions, and problem occurs in workflow: https://github.com/velmafia/k8s_issue-74551/actions/runs/346450615
kind
logs are stored in artifacts, feel free to download/fork and retest also with your version on k8s.we’re also seeing similar issues at Twitter in our usages of
kubectl port-forward
.I have created a simple client+server reproducer at https://github.com/kargakis/k8s-74551-reproducer
This issue is unfortunate,
kubectl port-forward
makes it very easy for our less kube-experienced developers to leverage our production deployment for development.Hi, I’m hitting this on my cluster too.
Here’s what we’re currently running: Client Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.3”, GitCommit:“721bfa751924da8d1680787490c54b9179b1fed0”, GitTreeState:“clean”, BuildDate:“2019-02-04T04:48:03Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“11”, GitVersion:“v1.11.6”, GitCommit:“b1d75deca493a24a2f87eb1efde1a569e52fc8d9”, GitTreeState:“clean”, BuildDate:“2018-12-16T04:30:10Z”, GoVersion:“go1.10.3”, Compiler:“gc”, Platform:“linux/amd64”}
Any additional repro info that I can provide?
happens to me when port-forwarding to minio and trying to download large files, getting:
E0706 18:38:06.048525 1 portforward.go:340] error creating error stream for port 9000 -> 9000: Timeout occured
Wow! I wasn’t expecting that culprit! great find!
one of the reasons, there can be more, is that the connection can be idle and some intermediate devices with low timers close the TCP session, so it eventually times out in one/both the endpoints of the connection. SSH can configure periodic keepalives to keep the traffic flowing and renewing the timeouts, or maybe you are using the session … This is why is important to know if with a continuous TCP stream you are hitting the issue,
This issue is waiting for a contributor to dig in and diagnose. My current guess is recorded above in the comment here.
I can also reliably reproduce it, by trying to fetch a 6MB file over a simple http server.
I also get this, especially when using kubectl port-forward for redis or postgres or haproxy.
I’m seeing this issue as well. I think kubelet is not properly closing the error channel associated with port forwarded connections, because I’m seeing the port forwarder get stuck here.
You don’t even need to send a lot of data, just break the reading end of the connection from the client-side while the port forwarder is trying to write. Assuming you’re port forwarding an HTTP server available on port 30600 locally, this will reliably reproduce the issue on k8s for docker for mac (server version v1.10.11):
I can reliably reproduce the issue on a newer version of k8s + ubuntu + minikube as well, but it is more resilient, and the above script won’t trigger the problem. I’m not sure yet how to make a minimally reproducible test for that target yet.
Looks like the same problem. You can try to use the containerd branch of https://github.com/sxllwx/containerd/tree/fix/k8s-issue-74551, I believe this issue can be resolved.
I’m also pushing this PR to be merged.
Looks like it’s the last commit in this branch: https://github.com/sxllwx/containerd/commits/fix/k8s-issue-74551
Wow, it really works! Thanks!
Upon adding the following to our
haproxy
front-end, this issue has pretty much been eliminated.and the following to the backend:
Upon making that change, absolutely zero timeouts for an idle
port-forward
for quite sometime. There must be similar options if you have a different load balancer in front of your Kubernetes API. Perhaps this is something that people who are affected by it can try that.is there any fix without upgrading containerd? I do not have access
I’ve had some success in getting better reliability by reducing MTU of the network interface.
an interesting side note, I was getting this same error using ‘curl’ commands to a gRPC service being port-forwarded to…I tried various things mentioned in this Issue but none of them worked. I then tried a golang program that hit the same port forwarded address and the error did not occur. No idea why, but thought it was worth mentioning for others seeing this error.
https://github.com/kubernetes/kubernetes/issues/74551#issuecomment-910520361 is the best short-term fix imo. have run into this many times testing many services port forwarding to my local network
I did not have success with @anthcor 's workaround. Hope this gets fixed.
/assign
@aojea the original issue and myself both have the error:
error copying from remote stream to local connection ... write: broken pipe
Full errors:
I think the other commented errors should be addressed in a separate issue so we could nail down why this specific error happens and fix it.
I get the error when I
port-forward
directly to haproxy, redis or postgres pod and try to read a lot of data through it. There is no load balancer in front of the apiserver. We’re using AWS EKS with versionv1.13.12-eks-eb1860
and client versionv1.16.3
Can you say more about that? That’s suspicious, if the problem is binary data, maybe there’s some specific byte sequence that is tickling a bug in the proxy.
Also got the same kind of problem as @rihardsk. When looking at the kube-registry logs from the container, at least the following error logs were seen
Edit: After increasing the memory limit (from 100Mi to 200Mi) of the registry container, I didn’t see the problem anymore.