velero: Velero 1.9.1 backup timed out waiting for all PodVolumeBackups to complete leaving process consuming 99% CPU running for >5 hours already

$ velero version
Client:
	Version: v1.9.1
	Git commit: e4c84b7b3d603ba646364d5571c69a6443719bf2
Server:
	Version: v1.9.1
  • Velero restic backup timed out:
time="2022-08-26T13:05:40Z" level=error msg="Error backing up item" backup=velero/velero-default-20220826090539 error="timed out waiting for all PodVolumeBackups to complete

Leaving a process consuming 99% CPU behind:

# ps -ef|grep 'restic backup'
root     2550974 2549993 99 09:05 ?        05:47:31 restic backup ...

Is running for more than 5 hours now.

That is the same restic PID which is running after I’ve lifted the memory limit from 1Gi to 2Gi here https://github.com/vmware-tanzu/velero/issues/2073#issuecomment-1228287647

Please advise.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (9 by maintainers)

Most upvoted comments

repository is already locked exclusively by PID 1673956

The backup command doesn’t use exclusive locks. It must be a different command that left that lock behind.

  • Even though the data doesn’t occupy disk space, since Restic has no optimization, the "0"s are counted as normal data in Restic’s deduplication system, so every KB of this data will take an index.

4TB shouldn’t require much more than 400MB for the index, if all chunks were different. However, as the file appears to largely consist of zeroes this will mean nearly perfect deduplication, thus the index size will probably only be a few MB. What requires a lot more memory is the directory metadata, with lots of zeros in it, the file will be cut into approx. 8 million chunks. The end result will be about 500MB directory metadata that will temporarily require at least 3 times as much memory (restic < 0.14.0 might require a bit more). Thus 2GB is definitely not enough, although 10GB might work, but that depends on the total number of files in the backed up folder.

Just processing the 4TB of zeros will take several hours, as apparently restic can only use a single CPU core, I’d expect it to manage something between 100-200MB/s.

Thanks a lot for providing so many testing results, this shows that tar tool works well for sparse files.

I agree tar tool could solve the current problem perfectly, but it is true for the current issue only. Velero as a universal Kubernetes backup tool cannot take in changes that work only for a particular case, instead, we pursue universal and sophisticated solutions. Especially when we know the right directions (as the 2 solutions I’ve mentioned) to solve the current problem.

As a general Kubernetes backup tool, Velero always pursues general solutions instead of making solutions for very limited workaround. Therefore, we need to consider below things:

  • Which files need tar is a question, the entire volume? some files or directories in the volume? We need a way for users to tell this.
  • This is also true for the --sparse option, which also needs users to tell Velero.
  • Velero needs to change the general backup and restore workflow to integrate with the new command tar
  • By calling tar --sparse, we lay everything on the tar tool, whether it behaves as what we are expecting? how much cpu/memory it takes? will it change? We don’t know, eventually the tar --sparse is not designed for this purpose

Therefore, there is quite a lot of work to make a general solution by Velero with your proposal, so it is not a quick fix; one the other hand, even if we have it, finally we may see the solution itself is not general enough. Therefore, we need a well-round evaluation on this solution.

On the other hand, Velero indeed aims to solve this kind of problem in future, but with another direction – the block level data movement. Please keep tune in Velero’s roadmap and release plan for this. Therefore, if there is no quick fix, we prefer to paying all the attentions on this existing plan in our roadmap.

I am closing this issue since we’ve known the cause of the CPU and memory usage and a Restic issue has been opened. The lock problem will be handled in #5268 separately.

Feel free to reopen it whenever needed.

unable to create lock in backend

This means some Restic processes have left an file lock in the repository. This may be normal since the current backup haven’t finished in 1 day. So when a new Restic process is launched the second day, it is blocked

The PodVolume timeout configuration 240 minutes. If your backup data is pretty much, you can set the PodVolume timeout by two ways:

  • Annotate the backup with “velero.io/pod-volume-timeout”. This will take affect on the backup you annotate.
  • Set the velero server parameter in Velero deployment with --restic-timeout <duration>. This will affect all the backups.

You may need to restart the Restic DaemonSet, if you don’t want to wait for the Restic backup process.

Restic process can consume a lot of memory. There are some senarios that Restic would use more than 10G memory. This is one of the reason Velero team is working on integration with Kopia. This usually happens backup a lot of small files with Restic. By now, please only backup the data and volume you needs, and don’t backup all volumes in cluster to avoid this situation.