kubernetes: kubectl cp fails on large files
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
Copying either a large file, or a large directory from a container via kubectl cp
results in the error of error: unexpected EOF
and a failure to transfer. In my case, the file is 1.7G.
I executed the following command
kubectl cp infra-cassandra-global-0:cassandra.tar.gz infra-cassandra-global-0-cassandra.tar.gz
The command executes, however the terminal will print the error below after 10-14 seconds of execution, and no file is copied.
error: unexpected EOF
What you expected to happen:
The large file to be downloaded from the container.
How to reproduce it (as minimally and precisely as possible):
Add a large file >= 1.7G to any location in the pod. I was able to re-create this will the file on a PV, or locally on the image file system.
Execute kubectl cp
to download the large file. It will fail.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): v1.8.6, client v1.8.5 - Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release): k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-12-02 (ami-06a57e7e)
- Kernel (e.g.
uname -a
): - Install tools: Kops
- Others:
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 124
- Comments: 203 (71 by maintainers)
why not
@matthyx This works wonders! It appears
--retries
does not completely retry the file copy but will actually resume the file transfer when the EOF error occurs. Setting--retries
very high should work for most cases (and worked for me!).Before:
Files size on local machine was not the same as in the container – indicating the transfer dropped.
After:
And
ls -l
confirms the file size is the same.Thank you!
Crazy that after 5 years this is still an issue…
This problem may be solved if kubectl cp would implement resumable downloads to work around temporary network errors.
I tried to copy single large file using kubectl (version 1.21) cp command. It failed giving EOF error. However when I tried to cp entire folder which contains the large file, it worked successfully. All the files including the large file got copied without any error.
kubectl cp <pod_id>:/home/test_folder test_folder_localhost
I ended up splitting the file into 500mb chunks and copied over each chunk 1 at a time and it worked fine.
split ./largeFile.bin -b 500m part.
copy all of the fileskubectl cp <pod>:<path_to_part.aa> part.aa
Then reassemble with catcat part* > largeFile.bin
I suggest you use a checksum to validate the files integrity once you are doneFor “file transfer” use cases, I have added to
kubectl
the (unfortunately named)--retries
flag that will recover from interruptions and continue the transfer where it failed.To workaround the issue, I ended up running “cat” on the file and redirecting output to the location I wanted to copy the file. Ex:
kubectl exec -i [pod name] -c [container name] -- cat [path to file] > [output file]
I found the issue here in our environment. We had blocked ICMP packets in the SG attached to the ENI of our API ELB (CLB these days). This meant that requests to fragment large packets were not getting back to the ELB. Because of that, the packet was never re-sent by the ELB, which meant it was lost. This breaks the TCP session and the connection resets.
In short, make sure that ICMP is allowed between your Load Balancer and your hosts, and your MTU settings are correctly calibrated.
Please also tell me what happens when you use the
--retries=10
option ofkubectl cp
… does it always resumes at the same number of bytes?Hello, we understand that you’re frustrated but we don’t speak to each other like that here.
There’s 1.7k issues in this repository and not as many maintainers as you think.
We can’t help you unless the issue is nailed down enough that we can reproduce it.
on behalf of the CoCC
–retries works like a charm. Thanks @matthyx
@davesuketu215 here it is: https://github.com/kubernetes/kubernetes/pull/104792 This only affects the cli client and doesn’t require any node or api-server change… which means as soon as you can download an official binary build (not sure if we have nightly ones) it will work in your environment if you use the new
kubectl
.I also have the EOF problem (intermittently) running v1.21 on MacOS. I did not try the downgrade suggested above, but I did learn that
kubectl cp
is really just a thin wrapper onkubectl exec
plustar
. I rolled my own, getting rid of tar and adding compression and it works much better.kubectl exec -n $ns $pod -c $container -- gzip -c trace.log | gzip -cd > trace.log
or don’t even bother to decompress at the end and dokubectl exec -n $ns $pod -c $container -- gzip -c trace.log > trace.log.gz
I just did a 54 GB file in about 12 minutes with no errors./triage accepted /assign @matthyx
Let’s try to reproduce
I just ran into the same issue. Interestingly a larger file worked, while a smaller file repeatedly failed. I don’t thing it has something to do with the file size, but with the fact that the command/protocol involved cannot properly handle binary data. What worked for me, is to base64 encode the data on the fly:
-c perf-sidecar
to target the container in the pod (you can leave this out if you only have one)deploy/mqtt-endpoint
for a pod of the deployment (you can also directly use the pod name)Thanks for the report… I’m super interested in transfers that fail consistently and retry at the exact same place, which would mean we have a special kill sequence in the flow (which I doubt until I see one).
Looks like this is still an issue with 1.20
I was also facing this same issue. My file was 10.3 GB My work around was was compressing the file using xz (compression level: 13% of original size) then splitting it into 10mb chunks. Then I wrote a script on my end to fetch the individual chunks. Then I joined them back together using cat. Below are the commands that I used:
I’d suggest that you please raise a support ticket with your cloud provider. If you are curious, this is the list of active companies within the project and in the kubernetes/kubernetes repo in the past two years: https://k8s.devstats.cncf.io/d/8/company-statistics-by-repository-group?orgId=1
For more context, this SIG Node’s current bug backlog: https://github.com/orgs/kubernetes/projects/59/
We review incoming bugs weekly at our CI subproject meeting, but we don’t have the resources to meaningfully address every bug right now. I’d like to thank @matthyx for prioritizing this and all the work he’s been doing trying to get this fixed given everything on our plates!
Thanks @matthyx.
Thanks a lot for all the testing… I will work on my PR to make it mergeable and add an option to set the number of retries like wget:
If anyone is in AWS, check this out: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html#path_mtu_discovery
Need to ensure you have inbound ICMP on your loadbalancers to your instances. This meant that requests to fragment large packets were not getting back to the ELB. Because of that, the packet was never re-sent by the ELB, which meant it was lost. This breaks the TCP session and the connection resets.
--retries
worked well enough for me too. It’s not pretty, but it works. Seems curiously appropriate for this use case. 😁OK, here is another repro, with timestamps and logs.
apiserver logs around that time:
kubelet logs around that time:
It seems like when the transfer is >99.9% done, the process running in the container (
seq 1000000000
ortar
or whatever) ends, we close the connection, even though the data hasn’t been sent to the client yet.IMO something like this is happening
tar
) is writing to stdout and terminates.tar
output is still sitting in the TCP buffer and hasn’t made it to the client yettar
output is sent to the client.This would explain why I’m consistently seeing connection teardowns when the transfer is 99.9% done and why
sleep 10
mitigates the issue.I doubt this will happen anytime soon, since it would require having
rsync
in the image… sig-cli required that I use onlytar
because this has been the only requirement historically.That said, we could do some tricks with ephemeral containers now that it’s enabled by default, and mount a image containing
rsync
.Ah, and FYI,
--retries
despites the name does resume where it failed, so it IS a valid solution for production.same here.
Trying to download Prometheus snapshots ~= 3.8G
files in the pod:
Download via
kubectl cp
:same for the
prom-snapshot.tar.gz
file.I am using:
We should test this issue on websockets instead SPDY when the websockets are available.
I was found the problem come from network infrastructure issue. kubectl work with kubernetes api via HTTPS protocol, from ‘apply, create, exec, cp’. All of them are work over HTTPS and header “X-Stream-Protocol-Version: channel.k8s.io”, cp command only work if container in pod has tar command. kube call API to exec to container and fetch stdout by tar command in container as below.
tar -xmf - -C /path/to/container/file
And stdout streamed backup to the client. All traffic are passed over HTTPS protocol or rather than is over TLS. Later, I was tried to SSH over SSL to Worker Node and run
ssh cp ...
to copy file directly to node via SSH, and them not success, with a file 100 Mb, it’s always stuck at 69 Mb or 70 Mb (70% of total capacity file). This proves that problem come from network infrastructure issue, not from kubectl or CNI network even.My network system runs under an SDN (VMware NSX) with Edge Router are routed to external/other network via OSPF routing. With multiple Edge Router to Multiple ToR device, OSPF + ECMP and stateful firewall -> and this is pain point of error. Stateful firewall in VMware NSX are not work with routing ECMP, because ECMP pass traffic to all of Edge Router, And Stateful firewall in NSX only operate one active device at a time. Then I set firewall policy to stateless and everything work as expected,
kubectl cp
,ssh cp
or any application use over TLS protocol are work. So, Please make sure all your application run over TLS protocol is work, test copy or transfer file with larger than 2 MB or a few tens of MB to a few hundred.Here is a workaround that works for me:
The
sleep 10
step is intended to prevent premature teardown, which seemed to be the problem in my case.It’s clearly a hacky workaround to mitigate a race condition in the teardown. But it helps me copy the file over, so I’ll take it.
Edit: Simpler Repro
Here is a simpler repro of the issue. As
tail
shows, the output terminated a little prematurely:It doesn’t repro every time, but frequently enough. I’m accessing a Kubernetes cluster across regions, so that could be a contributing cause, resulting in a timing behavior that triggers the race condition.
When I add a sleep, I don’t see the issue:
In disaster recovery scenarios
--retries=10
isn’t going to cut it.Here’s how you can use
rsync
to workaround this issue: https://vhs.codeberg.page/post/recover-files-kubernetes-persistent-volume/https://user-images.githubusercontent.com/97140109/156882836-1cc82ff0-0a6e-4458-b30d-410801f33c83.mp4
@matthyx
--retries=10
worked for me as it was resuming where it failed.In my case the file is only 17MB. Can’t do it.
And I’m not working for Google (maybe one day dear recruiters?) and do this for fun during my free time.
A few hours have passed. The approach I settled on uses
rsync
and a script shared by @karlbunch on ServerFault. It’s robust, resumable and provides progress indication. I’m using it like so:To get at the data in my
PersistentVolume
I create a Pod with the PV attached as described here. In addition to downloading large files, I can reliably download many small files as well. When the script fails due to network conditions I simply rerun the script and it picks up where it left off. This approach can be used for downloading as well as uploading.OK, I think I’ve figured out why this is different based on the client and why @dlipofsky 's gziping method helped him (it didn’t help me…). What is going on, is that if the download speed is too fast some kind of error occurs. I don’t know why or what kind of error it is, but it is closely related to download speed.
Catting large files works in the terminal because that is slow. @dlipofsky 's method helps because gzip can also slow things down on slower machines.
For me, I had to use pv to do rate limiting.
kubectl exec -i dpnk-675b9ff794-h9s94 -- gzip -c /home/aplikace/kb.zip | pv -L 100K -q | gzip -cd > ~/p r/auto-mat/gis/firmy/kb/2021.zip
works butkubectl exec -i dpnk-675b9ff794-h9s94 -- gzip -c /home/aplikace/kb.zip | gzip -cd > ~/pr/auto-mat/gis/firmy/kb/2021.zip
gives EOF.@leberknecht Looks like there’s a regression with kubectl cp on kubectl 1.21+, we found that downgrading kubectl client to 1.19 worked in those cases (I haven’t tested if 1.20 has any problem)
Any update on this one?
Facing the same issue when copying large files > 600MB. Basically the following errors:
unexpected EOF
-> copy and exec outputsError proxying data from client to backend: unexpected EOF
-> API-servererror forwarding port <port> to pod <pod_id>, uid : EOF:
-> KubeletThis is happening from within the cluster as well.
Any ideas?
Same for us. clusters are generated using kops, ICMP is allowed, EOFs are still here.
If you got an older kubectl version (< 1.23), the retries option would not be available of course:
But, if you look at kubectl options, you’ll find the request timeout option as well. So, if you would put that to a higher number, like
kubectl cp infra-cassandra-global-0:cassandra.tar.gz infra-cassandra-global-0-cassandra.tar.gz --request-timeout=3m
, you could also mitigate the EOF error!Source
--retries 999
💚 Thanks @matthyx.I don’t have anything to add to the analysis here. Kubernetes needs to migrate off of SPDY the sooner the better. I tried to make this happen, but looks like it was/is not a priority for the maintainers. https://github.com/kubernetes/kubernetes/issues/89163, https://github.com/kubernetes/enhancements/pull/3401, https://github.com/kubernetes/kubernetes/pull/110142 expired.
Edit: this was opened in 2015 (!) and no progress has been made.
thanks @vhscom this solution solved my problem
my two cents: use
--retries=-1
for infinite retryingIt’s available in 1.24 kubectl official binary.
@matthyx first of all, thanks for the PR. Do you want me to try with the new code? What would be the point, the new code with the retry’s probably works, but my point is, the retrys shouldn’t be necesary.
Yeah, looks like you don’t have connection issues… I wonder if that’s the case for @shanmukha511 @Spareo @smb-h and @akoeb-dkb
But thanks to your tests @bespanto, we realized if the file gets modified in the middle of the transfer, it confuses
tar
which is consistent with @aojea comment.For those who cannot build it, here is a downloadable binary for linux amd64 (30 days valid): https://www.swisstransfer.com/d/1dfcbf1e-098b-4a85-9f6b-ec7142211019
same here, tar archive with 225MB.
The “workaround” was a bit more reliable for me, after I added a “sleep” to the shell command:
I investigated this issue a little bit, but unfortunately I didn’t find the fix. Maybe this info could help someone else.
When the issue happens, Kubelet logs:
(port 45563 is the kubelet port)
And kube-apiserver logs:
(172.25.16.204 is the master node where kube-apiserver is running, 172.25.16.208 is the worker node where the container with the file is running)
To help replicate the issue, using
pv
to limit download speeds helped me:kubectl exec -i ... -- cat ... | pv -L 1m > copied_file.txt
Hi, found nice solution called “devspace” (https://github.com/devspace-cloud/devspace) using this tool you can run for example : devspace sync --pod=your pod name --container-path=path --download-only for me it worked great !
Same here
filesize
/remove-lifecycle rotten
But it may be caused by my mac env
/sig node /sig network /sig cli
I think it would be better to have something isolated on a laptop… maybe with
tc
to add latency betweenkubectl
and thekind
cluster?sleep 10
works for me, thanks.I’m having the same issue. Both
kubectl exec -- cat
andkubectl cp
fails to copy the entire file.While debugging, I found this:
It seems the connection is dropped because a SPDY ping failed. Hope this helps someone to identify the problem.
I haven’t done too much of testing, but got pretty consistent results upon few retries (file size is 273557558 bytes):
returns a random checksum pretty much every time it’s run, where as:
consistently returns correct checksum (20s is just a random pick).
kubectl cp
is calling tar and it seems to be the same issue for both. Perhaps buffers are not flushed before the process exists?Client Version: version.Info{Major:“1”, Minor:“24”, GitVersion:“v1.24.0”, GitCommit:“4ce5a8954017644c5420bae81d72b09b735c21f0”, GitTreeState:“clean”, BuildDate:“2022-05-03T13:46:05Z”, GoVersion:“go1.18.1”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“20”, GitVersion:“v1.20.15”, GitCommit:“8f1e5bf0b9729a899b8df86249b56e2c74aebc55”, GitTreeState:“clean”, BuildDate:“2022-01-19T17:23:01Z”, GoVersion:“go1.15.15”, Compiler:“gc”, Platform:“linux/amd64”}
In my case I was able to copy using tar and compressing and decompressing the file contents using a VPN and with no problem and with a file sized over 1GB:
kubectl exec -n <namespace> <pod-name> – tar czf - <source-file-path> | tar xzf -
In my case it was VPN configuration that was causing the problem. I turned off VPN and no error, just works!
https://github.com/kubernetes/kubernetes/issues/60140#issuecomment-1049872517
Thanks, this work for me
reporting same problem here, macOS monterey m1 kubectl v1.23.2
both using openvpn and without vpn trying to cp sql file with ~200MB size, always failed at the end. had to set --retries to non-zero to workaround this like above suggestion
without --retries
with --retries=50
can you share that file? Having a reproducer simplifies everything
Please re-open. As per my last comment, this PR does NOT fix the underlying problem. This is NOT caused by network errors, but rather some sort of underlying bug in k8s.
Hmm, technically ATM I don’t see how to do it, unless relying on something else that
tar
andtail
to be present inside the container image… The problem being that AFAIKtar
cannot resume on a partial archive, so the only idea I see would be to transfer the whole archive (with retries) and then untar it, doubling the space requirements for the transfer.If you don’t fear running a binary built on my PC, you can check several comment up to see download links for Linux amd64 and MacOS. Otherwise you can checkout my branch and run
make kubectl
.Dont know if it helps, but we have a similar problem here, my colleague runs into the EOF, im not. He is running
kubectl
on a mac, im on linux, wondering if that could be related? The file is not modifed during the transfer, the EOF happens after 67.9 MB of a 74 MB file was transferred.my version:
his version where the EOF happens:
uh…that warning is already interesting
Hi @matthyx
I believe it’s not about size of file because we copy 100’s GB data from one pod to other pod.And even in few cases less than 500MB we copy in both cases we facing the error. And the issue is intermittent not every time it occurs.
From my initial testing, it seems
kubectl cp
works well as long as the network connectivity is stable. I was able to transfer several GB files with no issue. But looking at the implementation, it’s clear there is no retry mechanism.Maybe I could try to hack something like:
tar cf - ...
in the podHeader.Size
and how much we wrote in the current file)tar cf - ...
but piped into atail -c +$(($SIZE+1))
and we continueIdea taken from stackexchange.
What do you think?
This problem made me sick recently since i needed to move data for some reason. And none of above workarounds worked 100%. Not even more than 20% chance of success from what I’ve been experienced. So finally i have found this service (https://transfer.sh/) that you can upload file via curl command and download it in the same way. Easy to use and yeah problem solved for me! 😃 Of curse there are some security concerns you might have, but in my case compared to time spent on this issue, I was more than happy.
After some further testing I think I have to take back the “workaround” from my comment above. Currently it seems to have something to do with the length of the target file name (or the whole path) and the file size. But I did not have any time to follow up this lead. As you also mentioned, I normally try some different things (like renaming, …) until it works. Will post again if I find the time to test and get better “results”.
I have this issue as well. Weirdly it happens when my target file name contains an underscore! If I remove the underscore (which in my case makes the filename only containing letters and a
.
), then the file is copied just fine! I did not check yet, if other non-letter characters also produce any errors.Same here. It was working before. Getting this error on a 3MB file.