kubernetes: kubectl cp fails on large files

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: Copying either a large file, or a large directory from a container via kubectl cp results in the error of error: unexpected EOF and a failure to transfer. In my case, the file is 1.7G.

I executed the following command

kubectl cp infra-cassandra-global-0:cassandra.tar.gz infra-cassandra-global-0-cassandra.tar.gz

The command executes, however the terminal will print the error below after 10-14 seconds of execution, and no file is copied.

error: unexpected EOF

What you expected to happen:

The large file to be downloaded from the container.

How to reproduce it (as minimally and precisely as possible):

Add a large file >= 1.7G to any location in the pod. I was able to re-create this will the file on a PV, or locally on the image file system.

Execute kubectl cp to download the large file. It will fail.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.8.6, client v1.8.5
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02 (ami-06a57e7e)
  • Kernel (e.g. uname -a):
  • Install tools: Kops
  • Others:

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 124
  • Comments: 203 (71 by maintainers)

Most upvoted comments

why not

  1. put the large file on a ftp server
  2. using sftp to get the file from inside the container

For “file transfer” use cases, I have added to kubectl the (unfortunately named) --retries flag that will recover from interruptions and continue the transfer where it failed.

@matthyx This works wonders! It appears --retries does not completely retry the file copy but will actually resume the file transfer when the EOF error occurs. Setting --retries very high should work for most cases (and worked for me!).

Before:

$ kubectl cp pod:some.db.tar.gz local.db.tar.gz
tar: Removing leading `/' from member names
Dropping out copy after 0 retries
error: unexpected EOF

Files size on local machine was not the same as in the container – indicating the transfer dropped.

After:

$ kubectl cp pod:some.db.tar.gz local.db.tar.gz --retries 999
tar: Removing leading `/' from member names
Resuming copy at 23588864 bytes, retry 1/999
tar: Removing leading `/' from member names
Resuming copy at 23920640 bytes, retry 2/999
tar: Removing leading `/' from member names

And ls -l confirms the file size is the same.

Thank you!

Crazy that after 5 years this is still an issue…

This problem may be solved if kubectl cp would implement resumable downloads to work around temporary network errors.

I tried to copy single large file using kubectl (version 1.21) cp command. It failed giving EOF error. However when I tried to cp entire folder which contains the large file, it worked successfully. All the files including the large file got copied without any error.

kubectl cp <pod_id>:/home/test_folder test_folder_localhost

I ended up splitting the file into 500mb chunks and copied over each chunk 1 at a time and it worked fine.

split ./largeFile.bin -b 500m part. copy all of the files kubectl cp <pod>:<path_to_part.aa> part.aa Then reassemble with cat cat part* > largeFile.bin I suggest you use a checksum to validate the files integrity once you are done

I’m speculating here, the point is that these commands are not designed for intensive and high reliable workloads, I see them more like debugging and useful developers tools, and for things like this, people should use other mechanisms more reliable.

For “file transfer” use cases, I have added to kubectl the (unfortunately named) --retries flag that will recover from interruptions and continue the transfer where it failed.

To workaround the issue, I ended up running “cat” on the file and redirecting output to the location I wanted to copy the file. Ex: kubectl exec -i [pod name] -c [container name] -- cat [path to file] > [output file]

I found the issue here in our environment. We had blocked ICMP packets in the SG attached to the ENI of our API ELB (CLB these days). This meant that requests to fragment large packets were not getting back to the ELB. Because of that, the packet was never re-sent by the ELB, which meant it was lost. This breaks the TCP session and the connection resets.

In short, make sure that ICMP is allowed between your Load Balancer and your hosts, and your MTU settings are correctly calibrated.

Please also tell me what happens when you use the --retries=10 option of kubectl cp… does it always resumes at the same number of bytes?

Hello, we understand that you’re frustrated but we don’t speak to each other like that here.

There’s 1.7k issues in this repository and not as many maintainers as you think.

We can’t help you unless the issue is nailed down enough that we can reproduce it.

on behalf of the CoCC

–retries works like a charm. Thanks @matthyx

@davesuketu215 here it is: https://github.com/kubernetes/kubernetes/pull/104792 This only affects the cli client and doesn’t require any node or api-server change… which means as soon as you can download an official binary build (not sure if we have nightly ones) it will work in your environment if you use the new kubectl.

I also have the EOF problem (intermittently) running v1.21 on MacOS. I did not try the downgrade suggested above, but I did learn that kubectl cp is really just a thin wrapper on kubectl exec plus tar. I rolled my own, getting rid of tar and adding compression and it works much better. kubectl exec -n $ns $pod -c $container -- gzip -c trace.log | gzip -cd > trace.log or don’t even bother to decompress at the end and do kubectl exec -n $ns $pod -c $container -- gzip -c trace.log > trace.log.gz I just did a 54 GB file in about 12 minutes with no errors.

/triage accepted /assign @matthyx

Let’s try to reproduce

I just ran into the same issue. Interestingly a larger file worked, while a smaller file repeatedly failed. I don’t thing it has something to do with the file size, but with the fact that the command/protocol involved cannot properly handle binary data. What worked for me, is to base64 encode the data on the fly:

kubectl exec -i -c perf-sidecar deploy/mqtt-endpoint -- base64 /out/perf.data.tar.bz2 | base64 -d > perf.data.tar.bz2
  • -c perf-sidecar to target the container in the pod (you can leave this out if you only have one)
  • deploy/mqtt-endpoint for a pod of the deployment (you can also directly use the pod name)

@matthyx --retries=10 worked for me as it was resuming where it failed.

Thanks for the report… I’m super interested in transfers that fail consistently and retry at the exact same place, which would mean we have a special kill sequence in the flow (which I doubt until I see one).

Looks like this is still an issue with 1.20

I was also facing this same issue. My file was 10.3 GB My work around was was compressing the file using xz (compression level: 13% of original size) then splitting it into 10mb chunks. Then I wrote a script on my end to fetch the individual chunks. Then I joined them back together using cat. Below are the commands that I used:

  1. xz bigfile
  2. split bigfile.xy -db 10M xzchunks --verbose This will generate numbered 10mb chunks starting from xzchunks00…
  3. Then write a script from your end to fetch those chunks
  4. Then join them back using cat xzchunks* > bigfile.xz
  5. Finally unxz bigfile.xz and Voila… !!

My employer is paying a LOT of money to $company for k8s.

I’d suggest that you please raise a support ticket with your cloud provider. If you are curious, this is the list of active companies within the project and in the kubernetes/kubernetes repo in the past two years: https://k8s.devstats.cncf.io/d/8/company-statistics-by-repository-group?orgId=1

For more context, this SIG Node’s current bug backlog: https://github.com/orgs/kubernetes/projects/59/

We review incoming bugs weekly at our CI subproject meeting, but we don’t have the resources to meaningfully address every bug right now. I’d like to thank @matthyx for prioritizing this and all the work he’s been doing trying to get this fixed given everything on our plates!

–retries 999

Thanks @matthyx.

Thanks a lot for all the testing… I will work on my PR to make it mergeable and add an option to set the number of retries like wget:

-t number
--tries=number
    Set number of tries to number. Specify 0 or ‘inf’ for infinite retrying. The default is to retry 20 times.

If anyone is in AWS, check this out: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html#path_mtu_discovery

Need to ensure you have inbound ICMP on your loadbalancers to your instances. This meant that requests to fragment large packets were not getting back to the ELB. Because of that, the packet was never re-sent by the ELB, which meant it was lost. This breaks the TCP session and the connection resets.

--retries worked well enough for me too. It’s not pretty, but it works. Seems curiously appropriate for this use case. 😁

OK, here is another repro, with timestamps and logs.

ubuntu@ip-172-31-36-156:~$ for i in {1..100}; do kubectl exec pod1 -- seq 1000000000 | tail; echo ---; date; done
999217034
999217035
999217036
999217037
999217038
999217039
999217040
999217041
999217042
999217043
---
Fri Dec 23 18:57:38 UTC 2022

apiserver logs around that time:

E1223 18:57:38.096330       1 upgradeaware.go:440] Error proxying data from backend to client: unexpected EOF
E1223 18:58:17.850982       1 upgradeaware.go:440] Error proxying data from backend to client: write tcp 192.168.49.2:8443->192.168.49.1:44238: write: connection reset by peer

kubelet logs around that time:

Dec 23 18:57:37 minikube kubelet[1981]: E1223 18:57:37.749856    1981 upgradeaware.go:426] Error proxying data from client to backend: readfrom tcp 127.0.0.1:60112->127.0.0.1:43303: write tcp 127.0.0.1:60112->127.0.0.1:43303: write: connection reset by peer

It seems like when the transfer is >99.9% done, the process running in the container (seq 1000000000 or tar or whatever) ends, we close the connection, even though the data hasn’t been sent to the client yet.

IMO something like this is happening

  1. The process in the container (let’s say tar) is writing to stdout and terminates.
  2. Because of network latency, some of the tar output is still sitting in the TCP buffer and hasn’t made it to the client yet
  3. The kernel tears down and resets the TCP connection before the remaining tar output is sent to the client.

This would explain why I’m consistently seeing connection teardowns when the transfer is 99.9% done and why sleep 10 mitigates the issue.

I doubt this will happen anytime soon, since it would require having rsync in the image… sig-cli required that I use only tar because this has been the only requirement historically.

That said, we could do some tricks with ephemeral containers now that it’s enabled by default, and mount a image containing rsync.

Ah, and FYI, --retries despites the name does resume where it failed, so it IS a valid solution for production.

same here.

Trying to download Prometheus snapshots ~= 3.8G

files in the pod:

-rw-r--r--    1 1000     2000        4.2G Oct 17 19:09 prom-backup.tar.gz
-rw-r--r--    1 1000     2000        3.8G Oct 17 21:14 prom-snapshot.tar.gz

Download via kubectl cp:

$ kubectl cp default/prometheus-0:/prometheus/prom-backup.tar.gz prom-backup.tar.gz
Defaulting container name to prometheus.
tar: removing leading '/' from member names
error: unexpected EOF

same for the prom-snapshot.tar.gz file.

I am using:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T23:49:20Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

We should test this issue on websockets instead SPDY when the websockets are available.

I was found the problem come from network infrastructure issue. kubectl work with kubernetes api via HTTPS protocol, from ‘apply, create, exec, cp’. All of them are work over HTTPS and header “X-Stream-Protocol-Version: channel.k8s.io”, cp command only work if container in pod has tar command. kube call API to exec to container and fetch stdout by tar command in container as below.

tar -xmf - -C /path/to/container/file

And stdout streamed backup to the client. All traffic are passed over HTTPS protocol or rather than is over TLS. Later, I was tried to SSH over SSL to Worker Node and run ssh cp ... to copy file directly to node via SSH, and them not success, with a file 100 Mb, it’s always stuck at 69 Mb or 70 Mb (70% of total capacity file). This proves that problem come from network infrastructure issue, not from kubectl or CNI network even.

My network system runs under an SDN (VMware NSX) with Edge Router are routed to external/other network via OSPF routing. With multiple Edge Router to Multiple ToR device, OSPF + ECMP and stateful firewall -> and this is pain point of error. Stateful firewall in VMware NSX are not work with routing ECMP, because ECMP pass traffic to all of Edge Router, And Stateful firewall in NSX only operate one active device at a time. Then I set firewall policy to stateless and everything work as expected, kubectl cp, ssh cp or any application use over TLS protocol are work. So, Please make sure all your application run over TLS protocol is work, test copy or transfer file with larger than 2 MB or a few tens of MB to a few hundred.

Here is a workaround that works for me:

kubectl exec pod_name -- bash -c "tar cf - /remote/path; sleep 10" | tar xf - -C /local/path

The sleep 10 step is intended to prevent premature teardown, which seemed to be the problem in my case.

It’s clearly a hacky workaround to mitigate a race condition in the teardown. But it helps me copy the file over, so I’ll take it.

Edit: Simpler Repro

Here is a simpler repro of the issue. As tail shows, the output terminated a little prematurely:

ubuntu@ip-172-31-47-74:~$ kubectl exec $POD -- seq 1000000000 | tail
999619262
999619263
999619264
999619265
999619266
999619267
999619268
999619269
999619270
99ubuntu@ip-172-31-47-74:~$

It doesn’t repro every time, but frequently enough. I’m accessing a Kubernetes cluster across regions, so that could be a contributing cause, resulting in a timing behavior that triggers the race condition.

When I add a sleep, I don’t see the issue:

ubuntu@ip-172-31-47-74:~$ kubectl exec $POD -- bash -c "seq 1000000000; sleep 1" | tail
999999991
999999992
999999993
999999994
999999995
999999996
999999997
999999998
999999999
1000000000

In disaster recovery scenarios --retries=10 isn’t going to cut it.

Here’s how you can use rsync to workaround this issue: https://vhs.codeberg.page/post/recover-files-kubernetes-persistent-volume/

https://user-images.githubusercontent.com/97140109/156882836-1cc82ff0-0a6e-4458-b30d-410801f33c83.mp4

@matthyx --retries=10 worked for me as it was resuming where it failed.

In my case the file is only 17MB. Can’t do it.

And I’m not working for Google (maybe one day dear recruiters?) and do this for fun during my free time.

Not sure yet give me a few more hours of wasting my time on a 4 year old bug.

A few hours have passed. The approach I settled on uses rsync and a script shared by @karlbunch on ServerFault. It’s robust, resumable and provides progress indication. I’m using it like so:

./krsync -av --progress --stats pod@namespace:/src-dir ./dest-dir

To get at the data in my PersistentVolume I create a Pod with the PV attached as described here. In addition to downloading large files, I can reliably download many small files as well. When the script fails due to network conditions I simply rerun the script and it picks up where it left off. This approach can be used for downloading as well as uploading.

OK, I think I’ve figured out why this is different based on the client and why @dlipofsky 's gziping method helped him (it didn’t help me…). What is going on, is that if the download speed is too fast some kind of error occurs. I don’t know why or what kind of error it is, but it is closely related to download speed.

Catting large files works in the terminal because that is slow. @dlipofsky 's method helps because gzip can also slow things down on slower machines.

For me, I had to use pv to do rate limiting.

kubectl exec -i dpnk-675b9ff794-h9s94 -- gzip -c /home/aplikace/kb.zip | pv -L 100K -q | gzip -cd > ~/p r/auto-mat/gis/firmy/kb/2021.zip works but kubectl exec -i dpnk-675b9ff794-h9s94 -- gzip -c /home/aplikace/kb.zip | gzip -cd > ~/pr/auto-mat/gis/firmy/kb/2021.zip gives EOF.

Dont know if it helps, but we have a similar problem here, my colleague runs into the EOF, im not. He is running kubectl on a mac, im on linux, wondering if that could be related? The file is not modifed during the transfer, the EOF happens after 67.9 MB of a 74 MB file was transferred.

my version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-gke.901", GitCommit:"db8463aecdfe0eb5d9067effa2fd5ab3ff7a988e", GitTreeState:"clean", BuildDate:"2021-08-04T23:27:42Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

his version where the EOF happens:

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-13T15:45:10Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-gke.901", GitCommit:"db8463aecdfe0eb5d9067effa2fd5ab3ff7a988e", GitTreeState:"clean", BuildDate:"2021-08-04T23:27:42Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.18) exceeds the supported minor version skew of +/-1

uh…that warning is already interesting

@leberknecht Looks like there’s a regression with kubectl cp on kubectl 1.21+, we found that downgrading kubectl client to 1.19 worked in those cases (I haven’t tested if 1.20 has any problem)

Any update on this one?

Facing the same issue when copying large files > 600MB. Basically the following errors:

unexpected EOF -> copy and exec outputs Error proxying data from client to backend: unexpected EOF -> API-server error forwarding port <port> to pod <pod_id>, uid : EOF: -> Kubelet

This is happening from within the cluster as well.

Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.8", GitCommit:"0c6d31a99f81476dfc9871ba3cf3f597bec29b58", GitTreeState:"clean", BuildDate:"2019-07-08T08:38:54Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Any ideas?

Same for us. clusters are generated using kops, ICMP is allowed, EOFs are still here.

If you got an older kubectl version (< 1.23), the retries option would not be available of course:

> Error: unknown flag: --retries

But, if you look at kubectl options, you’ll find the request timeout option as well. So, if you would put that to a higher number, like kubectl cp infra-cassandra-global-0:cassandra.tar.gz infra-cassandra-global-0-cassandra.tar.gz --request-timeout=3m, you could also mitigate the EOF error!

Source

--retries 999 💚 Thanks @matthyx.

I don’t have anything to add to the analysis here. Kubernetes needs to migrate off of SPDY the sooner the better. I tried to make this happen, but looks like it was/is not a priority for the maintainers. https://github.com/kubernetes/kubernetes/issues/89163, https://github.com/kubernetes/enhancements/pull/3401, https://github.com/kubernetes/kubernetes/pull/110142 expired.

Edit: this was opened in 2015 (!) and no progress has been made.

In disaster recovery scenarios --retries=10 isn’t going to cut it.

Here’s how you can use rsync to workaround this issue: https://vhs.codeberg.page/post/recover-files-kubernetes-persistent-volume/

krsync-in-action.mp4

thanks @vhscom this solution solved my problem

@matthyx --retries=10 worked for me as it was resuming where it failed.

my two cents: use --retries=-1 for infinite retrying

It’s available in 1.24 kubectl official binary.

@matthyx first of all, thanks for the PR. Do you want me to try with the new code? What would be the point, the new code with the retry’s probably works, but my point is, the retrys shouldn’t be necesary.

Yeah, looks like you don’t have connection issues… I wonder if that’s the case for @shanmukha511 @Spareo @smb-h and @akoeb-dkb

But thanks to your tests @bespanto, we realized if the file gets modified in the middle of the transfer, it confuses tar which is consistent with @aojea comment.

Hi folks, I need some testers here… could you try to build kubectl from my branch and try some transfers? I would like to validate the approach before doing a proper PR, thanks!

For those who cannot build it, here is a downloadable binary for linux amd64 (30 days valid): https://www.swisstransfer.com/d/1dfcbf1e-098b-4a85-9f6b-ec7142211019

same here, tar archive with 225MB.

The “workaround” was a bit more reliable for me, after I added a “sleep” to the shell command:

kubectl exec -i [pod name] -c [container name] – bash -c “cat [path to file] && sleep 1” > [output file]

I investigated this issue a little bit, but unfortunately I didn’t find the fix. Maybe this info could help someone else.

When the issue happens, Kubelet logs:

Error proxying data from client to backend: read from tcp 127.0.0.1:38582->127.0.0.1:45563: write tcp 127.0.0.1:38582->127.0.0.1:45563: write: broken pipe

(port 45563 is the kubelet port)

And kube-apiserver logs:

Error proxying data from backend to client: read tcp 172.25.16.204:55262->172.25.16.208:10250: read: connection reset by peer

(172.25.16.204 is the master node where kube-apiserver is running, 172.25.16.208 is the worker node where the container with the file is running)

To help replicate the issue, using pv to limit download speeds helped me: kubectl exec -i ... -- cat ... | pv -L 1m > copied_file.txt

Hi, found nice solution called “devspace” (https://github.com/devspace-cloud/devspace) using this tool you can run for example : devspace sync --pod=your pod name --container-path=path --download-only for me it worked great !

Same here

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.0", GitCommit:"0ed33881dc4355495f623c6f22e7dd0b7632b7c0", GitTreeState:"clean", BuildDate:"2018-09-28T15:20:58Z", GoVersion:"go1.11", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:05:37Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl cp somepod-f667d484-wkc6s:isolate-0x5650204fe520-v8.log .
error: unexpected EOF

filesize

-rw-r--r--    1 root     root     602286098 Oct 16 05:51 isolate-0x5650204fe520-v8.log

/remove-lifecycle rotten

But it may be caused by my mac env

/sig node /sig network /sig cli

Are you (or someone else) interested to debug the issue? I can try to construct a repro on publicly available infrastructure. Egress cost is one challenge, but I can see how to set it up. (I guess GCP would be preferrable to AWS.)

I think it would be better to have something isolated on a laptop… maybe with tc to add latency between kubectl and the kind cluster?

Here is a workaround that works for me:

kubectl exec pod_name -- bash -c "tar cf - /remote/path; sleep 10" | tar xf - -C /local/path

The sleep 10 step is intended to prevent premature teardown, which seemed to be the problem in my case.

It’s clearly a hacky workaround to mitigate a race condition in the teardown. But it helps me copy the file over, so I’ll take it.

Edit: Simpler Repro

Here is a simpler repro of the issue. As tail shows, the output terminated a little prematurely:

ubuntu@ip-172-31-47-74:~$ kubectl exec $POD -- seq 1000000000 | tail
999619262
999619263
999619264
999619265
999619266
999619267
999619268
999619269
999619270
99ubuntu@ip-172-31-47-74:~$

It doesn’t repro every time, but frequently enough. I’m accessing a Kubernetes cluster across regions, so that could be a contributing cause, resulting in a timing behavior that triggers the race condition.

When I add a sleep, I don’t see the issue:

ubuntu@ip-172-31-47-74:~$ kubectl exec $POD -- bash -c "seq 1000000000; sleep 1" | tail
999999991
999999992
999999993
999999994
999999995
999999996
999999997
999999998
999999999
1000000000

sleep 10 works for me, thanks.

I’m having the same issue. Both kubectl exec -- cat and kubectl cp fails to copy the entire file.

While debugging, I found this:

❯ kubectl exec mypod -v9 -- seq 1000000 | tail
...
I0923 20:06:31.361134   22478 round_trippers.go:570] HTTP Statistics: DNSLookup 0 ms Dial 0 ms TLSHandshake 0 ms Duration 352 ms
I0923 20:06:31.361198   22478 round_trippers.go:577] Response Headers:
I0923 20:06:31.361219   22478 round_trippers.go:580]     Connection: Upgrade
I0923 20:06:31.361235   22478 round_trippers.go:580]     Upgrade: SPDY/3.1
I0923 20:06:31.361250   22478 round_trippers.go:580]     X-Stream-Protocol-Version: v4.channel.k8s.io
I0923 20:06:31.361265   22478 round_trippers.go:580]     Date: Fri, 23 Sep 2022 18:06:30 GMT
I0923 20:06:37.454012   22478 connection.go:198] SPDY Ping failed: connection closed
753729
753730
753731
753732
753733
753734
753735
753736
753737
75

❯

It seems the connection is dropped because a SPDY ping failed. Hope this helps someone to identify the problem.

I haven’t done too much of testing, but got pretty consistent results upon few retries (file size is 273557558 bytes):

kubectl exec POD_ID -- cat core.33.gz > aaa ; md5sum aaa

returns a random checksum pretty much every time it’s run, where as:

kubectl exec POD_ID -- bash -c 'cat core.33.gz && sleep 20' > aaa ; md5 aaa

consistently returns correct checksum (20s is just a random pick).

kubectl cp is calling tar and it seems to be the same issue for both. Perhaps buffers are not flushed before the process exists?

Client Version: version.Info{Major:“1”, Minor:“24”, GitVersion:“v1.24.0”, GitCommit:“4ce5a8954017644c5420bae81d72b09b735c21f0”, GitTreeState:“clean”, BuildDate:“2022-05-03T13:46:05Z”, GoVersion:“go1.18.1”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“20”, GitVersion:“v1.20.15”, GitCommit:“8f1e5bf0b9729a899b8df86249b56e2c74aebc55”, GitTreeState:“clean”, BuildDate:“2022-01-19T17:23:01Z”, GoVersion:“go1.15.15”, Compiler:“gc”, Platform:“linux/amd64”}

In my case I was able to copy using tar and compressing and decompressing the file contents using a VPN and with no problem and with a file sized over 1GB:

kubectl exec -n <namespace> <pod-name> – tar czf - <source-file-path> | tar xzf -

In my case it was VPN configuration that was causing the problem. I turned off VPN and no error, just works!

https://github.com/kubernetes/kubernetes/issues/60140#issuecomment-1049872517

Please also tell me what happens when you use the --retries=10 option of kubectl cp… does it always resumes at the same number of bytes?

Thanks, this work for me

reporting same problem here, macOS monterey m1 kubectl v1.23.2

both using openvpn and without vpn trying to cp sql file with ~200MB size, always failed at the end. had to set --retries to non-zero to workaround this like above suggestion

without --retries

tar: Removing leading `/' from member names
Dropping out copy after 0 retries
error: unexpected EOF

with --retries=50

tar: Removing leading `/' from member names
Resuming copy at 211193856 bytes, retry 0/50
tar: Removing leading `/' from member names

Then a smaller one failed repeatedly at the same location

can you share that file? Having a reproducer simplifies everything

Please re-open. As per my last comment, this PR does NOT fix the underlying problem. This is NOT caused by network errors, but rather some sort of underlying bug in k8s.

Hmm, technically ATM I don’t see how to do it, unless relying on something else that tar and tail to be present inside the container image… The problem being that AFAIK tar cannot resume on a partial archive, so the only idea I see would be to transfer the whole archive (with retries) and then untar it, doubling the space requirements for the transfer.

@matthyx - What are steps to download the cli client and use it to test the changes that you made to fix the download issue.

If you don’t fear running a binary built on my PC, you can check several comment up to see download links for Linux amd64 and MacOS. Otherwise you can checkout my branch and run make kubectl.

Dont know if it helps, but we have a similar problem here, my colleague runs into the EOF, im not. He is running kubectl on a mac, im on linux, wondering if that could be related? The file is not modifed during the transfer, the EOF happens after 67.9 MB of a 74 MB file was transferred.

my version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-gke.901", GitCommit:"db8463aecdfe0eb5d9067effa2fd5ab3ff7a988e", GitTreeState:"clean", BuildDate:"2021-08-04T23:27:42Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

his version where the EOF happens:

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-13T15:45:10Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-gke.901", GitCommit:"db8463aecdfe0eb5d9067effa2fd5ab3ff7a988e", GitTreeState:"clean", BuildDate:"2021-08-04T23:27:42Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.18) exceeds the supported minor version skew of +/-1

uh…that warning is already interesting

Hi @matthyx

I believe it’s not about size of file because we copy 100’s GB data from one pod to other pod.And even in few cases less than 500MB we copy in both cases we facing the error. And the issue is intermittent not every time it occurs.

From my initial testing, it seems kubectl cp works well as long as the network connectivity is stable. I was able to transfer several GB files with no issue. But looking at the implementation, it’s clear there is no retry mechanism.

Maybe I could try to hack something like:

  • usual tar cf - ... in the pod
  • if we detect something wrong, we try to estimate where we left (looking at the cumulated Header.Size and how much we wrote in the current file)
  • we launch another tar cf - ... but piped into a tail -c +$(($SIZE+1)) and we continue
  • repeat until the end

Idea taken from stackexchange.

What do you think?

This problem made me sick recently since i needed to move data for some reason. And none of above workarounds worked 100%. Not even more than 20% chance of success from what I’ve been experienced. So finally i have found this service (https://transfer.sh/) that you can upload file via curl command and download it in the same way. Easy to use and yeah problem solved for me! 😃 Of curse there are some security concerns you might have, but in my case compared to time spent on this issue, I was more than happy.

After some further testing I think I have to take back the “workaround” from my comment above. Currently it seems to have something to do with the length of the target file name (or the whole path) and the file size. But I did not have any time to follow up this lead. As you also mentioned, I normally try some different things (like renaming, …) until it works. Will post again if I find the time to test and get better “results”.

I have this issue as well. Weirdly it happens when my target file name contains an underscore! If I remove the underscore (which in my case makes the filename only containing letters and a .), then the file is copied just fine! I did not check yet, if other non-letter characters also produce any errors.

Same here. It was working before. Getting this error on a 3MB file.