velero: Velero / Restic backup fails for PVC after a few days

Hello together,

we have following problem with velero and restic backups.

What steps did you take and what happened: [A clear and concise description of what the bug is, and what commands you ran.)

At the moment we have four OKD v3.11 Cluster which are running in Azure. For the PVCs we’re using Azure File and Azure Disk. On all four Cluster we’re doing the same backups with velero and restic. Now here is the problem on one cluster we don’t have any problem and on the other three cluster we have this problem: The Backup is running without any problem for a few days, after a few days we get problems with the restic backups. We get PartiallyFailed errors for the backup of the PVCs. To fix the problem we have to delete all restic pods, but some restic pods couldn’t get terminated, so we have to restared the whole node to fix it. After this was done the backup is running a few days without any problem: velero backup get NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR cluster-20200514180050 PartiallyFailed (52 errors) 2020-05-14 18:02:30 +0000 UTC 29d default <none> cluster-20200514120050 PartiallyFailed (52 errors) 2020-05-14 12:01:39 +0000 UTC 29d default <none> cluster-20200514060050 PartiallyFailed (62 errors) 2020-05-14 06:00:50 +0000 UTC 29d default <none> cluster-20200513180049 PartiallyFailed (57 errors) 2020-05-13 18:01:37 +0000 UTC 28d default <none> cluster-20200513120049 PartiallyFailed (51 errors) 2020-05-13 12:00:49 +0000 UTC 28d default <none> cluster-20200513060049 Completed 2020-05-13 06:00:49 +0000 UTC 28d default <none> cluster-20200512180049 Completed 2020-05-12 18:00:49 +0000 UTC 27d default <none> cluster-20200512120049 Completed 2020-05-12 12:00:49 +0000 UTC 27d default <none> cluster-20200512060012 PartiallyFailed (52 errors) 2020-05-12 06:00:12 +0000 UTC 27d default <none> cluster-20200511180012 PartiallyFailed (52 errors) 2020-05-11 18:01:44 +0000 UTC 26d default <none> cluster-20200511120012 PartiallyFailed (52 errors) 2020-05-11 12:00:58 +0000 UTC 26d default <none> cluster-20200511060012 PartiallyFailed (52 errors) 2020-05-11 06:00:12 +0000 UTC 26d default <none> cluster-20200510180012 PartiallyFailed (52 errors) 2020-05-10 18:01:41 +0000 UTC 25d default <none> cluster-20200510120012 PartiallyFailed (52 errors) 2020-05-10 12:00:56 +0000 UTC 25d default <none> cluster-20200510060012 PartiallyFailed (52 errors) 2020-05-10 06:00:12 +0000 UTC 25d default <none> cluster-20200509180011 PartiallyFailed (52 errors) 2020-05-09 18:01:40 +0000 UTC 24d default <none> cluster-20200509120011 PartiallyFailed (52 errors) 2020-05-09 12:00:56 +0000 UTC 24d default <none> cluster-20200509060011 PartiallyFailed (52 errors) 2020-05-09 06:00:11 +0000 UTC 24d default <none> cluster-20200508180011 PartiallyFailed (52 errors) 2020-05-08 18:01:41 +0000 UTC 23d default <none> cluster-20200508120011 PartiallyFailed (52 errors) 2020-05-08 12:00:57 +0000 UTC 23d default <none> cluster-20200508060011 PartiallyFailed (52 errors) 2020-05-08 06:00:12 +0000 UTC 23d default <none> cluster-20200507180010 PartiallyFailed (52 errors) 2020-05-07 18:01:44 +0000 UTC 22d default <none> cluster-20200507120010 PartiallyFailed (52 errors) 2020-05-07 12:00:57 +0000 UTC 22d default <none> cluster-20200507060010 PartiallyFailed (52 errors) 2020-05-07 06:00:10 +0000 UTC 22d default <none> cluster-20200506180010 PartiallyFailed (51 errors) 2020-05-06 18:00:10 +0000 UTC 21d default <none> cluster-20200506120010 Completed 2020-05-06 12:00:10 +0000 UTC 21d default <none> cluster-20200506060010 Completed 2020-05-06 06:00:10 +0000 UTC 21d default <none> cluster-20200505180010 Completed 2020-05-05 18:00:10 +0000 UTC 20d default <none> cluster-20200505120009 Completed 2020-05-05 12:00:09 +0000 UTC 20d default <none> cluster-20200505060009 Completed 2020-05-05 06:00:09 +0000 UTC 20d default <none> cluster-20200504180009 Completed 2020-05-04 18:00:09 +0000 UTC 19d default <none> cluster-20200504120009 Completed 2020-05-04 12:00:09 +0000 UTC 19d default <none>

What did you expect to happen: That the backup is running without any problem and we don’t get the issue every few days.

The output of the following commands will help us better understand what’s going on:

kubectl logs deployment/velero -n velero We only get following log entries in the velero pod ``(Pasting long output into a GitHub gist or other pastebin is fine.) time=“2020-05-15T04:02:00Z” level=debug msg=“Checking repository for stale locks” controller=restic-repository logSource=“pkg/controller/restic_repository_controller.go:140” name=div7-**** namespace=openshift-backup time=“2020-05-15T04:02:01Z” level=debug msg=“Ran restic command” command=“restic unlock --repo=azure:okd-c1-1eu1-b-p:/restic/div7-**** --password-file=/tmp/velero-restic-credentials-div7–**** --cache-dir=/scratch/.cache/restic” logSource=“pkg/restic/repository_manager.go:275” repository=div7-**** stderr= stdout=“successfully removed locks\n” time=“2020-05-15T04:02:01Z” level=debug msg=resticRepositoryController.runMaintenanceIfDue controller=restic-repository logSource=“pkg/controller/restic_repository_controller.go:217” name=div7–**** namespace=openshift-backup time=“2020-05-15T04:02:01Z” level=debug msg=“not due for maintenance” controller=restic-repository logSource=“pkg/controller/restic_repository_controller.go:222” name=div7–**** namespace=openshift-backup time=“2020-05-15T04:02:01Z” level=debug msg=“Checking for existing backup storage locations to sync into cluster” controller=backup-sync logSource=“pkg/controller/backup_sync_controller.go:124” time=“2020-05-15T04:02:01Z” level=debug msg=“Checking if backups need to be synced at this time for this location” backupLocation=default controller=backup-sync logSource=“pkg/controller/backup_sync_controller.go:156” time=“2020-05-15T04:02:31Z” level=debug msg=“Checking for existing backup storage locations to sync into cluster” controller=backup-sync logSource=“pkg/controller/backup_sync_controller.go:124” time=“2020-05-15T04:02:31Z” level=debug msg=“Checking if backups need to be synced at this time for this location” backupLocation=default controller=backup-sync logSource=“pkg/controller/backup_sync_controller.go:156” time=“2020-05-15T04:02:31Z” level=debug msg=“Checking backup location for backups to sync into cluster” backupLocation=default controller=backup-sync logSource=“pkg/controller/backup_sync_controller.go:163” time=“2020-05-15T04:02:31Z” level=debug msg=“looking for plugin in registry” controller=backup-sync kind=ObjectStore logSource=“pkg/plugin/clientmgmt/manager.go:99” name=velero.io/azure time=“2020-05-15T04:02:31Z” level=debug msg=“creating new restartable plugin process” command=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync kind=ObjectStore logSource=“pkg/plugin/clientmgmt/manager.go:114” name=velero.io/azure time=“2020-05-15T04:02:31Z” level=debug msg=“starting plugin” args=“[/plugins/velero-plugin-for-microsoft-azure --log-level debug]” cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource=“pkg/plugin/clientmgmt/logrus_adapter.go:74” path=/plugins/velero-plugin-for-microsoft-azure time=“2020-05-15T04:02:31Z” level=debug msg=“plugin started” cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource=“pkg/plugin/clientmgmt/logrus_adapter.go:74” path=/plugins/velero-plugin-for-microsoft-azure pid=13825 time=“2020-05-15T04:02:31Z” level=debug msg=“waiting for RPC address” cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource=“pkg/plugin/clientmgmt/logrus_adapter.go:74” path=/plugins/velero-plugin-for-microsoft-azure time=“2020-05-15T04:02:31Z” level=debug msg=“using plugin” cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource=“pkg/plugin/clientmgmt/logrus_adapter.go:74” version=2 time=“2020-05-15T04:02:31Z” level=debug msg=“plugin address” address=/tmp/plugin136593745 cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource=“pkg/plugin/clientmgmt/logrus_adapter.go:74” network=unix pluginName=velero-plugin-for-microsoft-azure time=“2020-05-15T04:02:31Z” level=debug msg=“Got backups from backup store” backupCount=62 backupLocation=default controller=backup-sync logSource=“pkg/controller/backup_sync_controller.go:178” time=“2020-05-15T04:02:31Z” level=debug msg=“Got backups from cluster” backupCount=62 backupLocation=default controller=backup-sync logSource=“pkg/controller/backup_sync_controller.go:185” time=“2020-05-15T04:02:31Z” level=debug msg=“No backups found in the backup location that need to be synced into the cluster” backupLocation=default controller=backup-sync logSource=“pkg/controller/backup_sync_controller.go:198” time=“2020-05-15T04:02:31Z” level=debug msg=“plugin process exited” cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource=“pkg/plugin/clientmgmt/logrus_adapter.go:74” path=/plugins/velero-plugin-for-microsoft-azure pid=13825 time=“2020-05-15T04:02:31Z” level=debug msg=“plugin exited” cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource=“pkg/plugin/clientmgmt/logrus_adapter.go:74” time=“2020-05-15T04:02:50Z” level=debug msg=“Running processSchedule” controller=schedule key=openshift-backup/cluster logSource=“pkg/controller/schedule_controller.go:139” time=“2020-05-15T04:02:50Z” level=debug msg=“Getting Schedule” controller=schedule key=openshift-backup/cluster logSource=“pkg/controller/schedule_controller.go:145” time=“2020-05-15T04:02:50Z” level=debug msg=“Cloning schedule” controller=schedule key=openshift-backup/cluster logSource=“pkg/controller/schedule_controller.go:163” time=“2020-05-15T04:02:50Z” level=debug msg=“Schedule is not due, skipping” controller=schedule logSource=“pkg/controller/schedule_controller.go:250” nextRunTime=“2020-05-15 06:00:00 +0000 UTC” schedule=openshift-backup/cluster time=“2020-05-15T04:14:01Z” level=debug msg=“Backup has not expired yet, skipping” backup=openshift-backup/cluster-20200424120035 controller=gc-controller expiration=“2020-05-24 12:00:35 +0000 UTC” logSource=“pkg/controller/gc_controller.go:133”`
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml `Name: cluster-20200514180050 Namespace: openshift-backup Labels: velero.io/schedule-name=cluster velero.io/storage-location=default Annotations: <none>

Phase: PartiallyFailed (run velero backup logs cluster-20200514180050 for more information)

Errors: 52 Warnings: 0

Namespaces: Included: * Excluded: openshift-logging, openshift-monitoring, openshift-backup, cocdit-opendistro, lens-metrics, monitoring

Resources: Included: * Excluded: storageclasses.storage.k8s.io, groups.user.openshift.io, identities.user.openshift.io, useridentitymappings.user.openshift.io, users.user.openshift.io, images.image.openshift.io Cluster-scoped: included

Label selector: <none>

Storage Location: default

Snapshot PVs: false

TTL: 720h0m0s

Hooks: <none>

Backup Format Version: 1

Started: 2020-05-14 18:02:30 +0000 UTC Completed: 2020-05-15 00:03:22 +0000 UTC

Expiration: 2020-06-13 18:02:30 +0000 UTC

Persistent Volumes: <none included>

Restic Backups (specify --details for more information): Completed: 14 New: 38`

velero backup logs <backupname> time="2020-05-15T00:02:50Z" level=info msg="Executing podAction" backup=openshift-backup/cluster-20200514180050 cmd=/velero logSource="pkg/backup/pod_action.go:51" pluginName=velero time="2020-05-15T00:02:50Z" level=info msg="Done executing podAction" backup=openshift-backup/cluster-20200514180050 cmd=/velero logSource="pkg/backup/pod_action.go:77" pluginName=velero time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=livewriter-12-ljwsz namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=livewriter-12-ljwsz namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Executing post hooks" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:247" name=livewriter-12-ljwsz namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=info msg="Backing up item" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:169" name=reverse-proxy-dashboard-9-fwtgb namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Executing pre hooks" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:171" name=reverse-proxy-dashboard-9-fwtgb namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=reverse-proxy-dashboard-9-fwtgb namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=info msg="Executing custom action" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:330" name=reverse-proxy-dashboard-9-fwtgb namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=info msg="Executing podAction" backup=openshift-backup/cluster-20200514180050 cmd=/velero logSource="pkg/backup/pod_action.go:51" pluginName=velero time="2020-05-15T00:02:50Z" level=info msg="Done executing podAction" backup=openshift-backup/cluster-20200514180050 cmd=/velero logSource="pkg/backup/pod_action.go:77" pluginName=velero time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=reverse-proxy-dashboard-9-fwtgb namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=reverse-proxy-dashboard-9-fwtgb namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Executing post hooks" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:247" name=reverse-proxy-dashboard-9-fwtgb namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=info msg="Backing up item" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:169" name=saml-service-provider-9-5wrc4 namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Executing pre hooks" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:171" name=saml-service-provider-9-5wrc4 namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=saml-service-provider-9-5wrc4 namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=info msg="Executing custom action" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:330" name=saml-service-provider-9-5wrc4 namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=info msg="Executing podAction" backup=openshift-backup/cluster-20200514180050 cmd=/velero logSource="pkg/backup/pod_action.go:51" pluginName=velero time="2020-05-15T00:02:50Z" level=info msg="Adding pvc session-pvc to additionalItems" backup=openshift-backup/cluster-20200514180050 cmd=/velero logSource="pkg/backup/pod_action.go:67" pluginName=velero time="2020-05-15T00:02:50Z" level=info msg="Done executing podAction" backup=openshift-backup/cluster-20200514180050 cmd=/velero logSource="pkg/backup/pod_action.go:77" pluginName=velero time="2020-05-15T00:02:50Z" level=info msg="Backing up item" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:169" name=session-pvc namespace=div7-**** resource=persistentvolumeclaims time="2020-05-15T00:02:50Z" level=debug msg="Executing pre hooks" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:171" name=session-pvc namespace=div7-**** resource=persistentvolumeclaims time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=session-pvc namespace=div7-**** resource=persistentvolumeclaims time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=session-pvc namespace=div7-**** resource=persistentvolumeclaims time="2020-05-15T00:02:50Z" level=info msg="Executing custom action" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:330" name=session-pvc namespace=div7-**** resource=persistentvolumeclaims time="2020-05-15T00:02:50Z" level=info msg="Executing PVCAction" backup=openshift-backup/cluster-20200514180050 cmd=/velero logSource="pkg/backup/backup_pv_action.go:49" pluginName=velero time="2020-05-15T00:02:50Z" level=info msg="Backing up item" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:169" name=pvc-69581993-6f34-11ea-8d0a-000d3aaaafab namespace= resource=persistentvolumes time="2020-05-15T00:02:50Z" level=debug msg="Executing pre hooks" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:171" name=pvc-69581993-6f34-11ea-8d0a-000d3aaaafab namespace= resource=persistentvolumes time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=pvc-69581993-6f34-11ea-8d0a-000d3aaaafab namespace= resource=persistentvolumes time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=pvc-69581993-6f34-11ea-8d0a-000d3aaaafab namespace= resource=persistentvolumes time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=pvc-69581993-6f34-11ea-8d0a-000d3aaaafab namespace= resource=persistentvolumes time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=pvc-69581993-6f34-11ea-8d0a-000d3aaaafab namespace= resource=persistentvolumes time="2020-05-15T00:02:50Z" level=info msg="Executing takePVSnapshot" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:400" name=pvc-69581993-6f34-11ea-8d0a-000d3aaaafab namespace= resource=persistentvolumes time="2020-05-15T00:02:50Z" level=info msg="Backup has volume snapshots disabled; skipping volume snapshot action." backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:403" name=pvc-69581993-6f34-11ea-8d0a-000d3aaaafab namespace= resource=persistentvolumes time="2020-05-15T00:02:50Z" level=debug msg="Executing post hooks" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:247" name=pvc-69581993-6f34-11ea-8d0a-000d3aaaafab namespace= resource=persistentvolumes time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=session-pvc namespace=div7-**** resource=persistentvolumeclaims time="2020-05-15T00:02:50Z" level=debug msg="Executing post hooks" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:247" name=session-pvc namespace=div7-**** resource=persistentvolumeclaims time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=saml-service-provider-9-5wrc4 namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Skipping action because it does not apply to this resource" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:311" name=saml-service-provider-9-5wrc4 namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=debug msg="Executing post hooks" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/item_backupper.go:247" name=saml-service-provider-9-5wrc4 namespace=div7-**** resource=pods time="2020-05-15T00:02:50Z" level=info msg="1 errors encountered backup up item" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/resource_backupper.go:283" name=saml-service-provider-9-5wrc4 namespace= resource=pods time="2020-05-15T00:02:50Z" level=error msg="Error backing up item" backup=openshift-backup/cluster-20200514180050 error="timed out waiting for all PodVolumeBackups to complete" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:172" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:287" name=saml-service-provider-9-5wrc4 namespace= resource=pods

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

For each PVC we got following error message so we have increased the restic timeout at first to 6 hours and then to 12 hours but this doesn’t solved the issue: time="2020-05-15T00:02:55Z" level=info msg="1 errors encountered backup up item" backup=openshift-backup/cluster-20200514180050 group=v1 logSource="pkg/backup/resource_backupper.go:283" name=saml-service-provider-22-sgnh6 namespace= resource=pods time="2020-05-15T00:02:55Z" level=error msg="Error backing up item" backup=openshift-backup/cluster-20200514180050 error="timed out waiting for all PodVolumeBackups to complete" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:172" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" group=v1 logSource="pkg/backup/resource_backupper.go:287" name=saml-service-provider-22-sgnh6 namespace= resource=pods
velero restic repo get NAME STATUS LAST MAINTENANCE div1-c4* Ready 2020-05-11 12:08:17 +0000 UTC div1-cb* Ready 2020-05-08 12:08:48 +0000 UTC div7-b*wb Ready 2020-05-08 12:03:19 +0000 UTC div7-b*bd Ready 2020-05-08 12:08:52 +0000 UTC div7-b*kb Ready 2020-05-14 00:02:03 +0000 UTC div7-b*jr Ready 2020-05-08 12:08:10 +0000 UTC div7-b*gv Ready 2020-05-08 12:10:57 +0000 UTC div7-b*h7 Ready 2020-05-08 12:08:22 +0000 UTC div7-b*g2 Ready 2020-05-08 12:08:39 +0000 UTC div7-b*rh Ready 2020-05-08 12:11:12 +0000 UTC div7-b*kv Ready 2020-05-08 12:09:55 +0000 UTC div7-b*5r Ready 2020-05-12 13:06:58 +0000 UTC div7-b*mz Ready 2020-05-08 12:10:27 +0000 UTC seed-g2* Ready 2020-05-08 12:10:47 +0000 UTC
oc describe sc block Name: block IsDefaultClass: No Provisioner: kubernetes.io/azure-disk Parameters: kind=Managed,storageaccounttype=Standard_LRS AllowVolumeExpansion: <unset> MountOptions: <none> ReclaimPolicy: Delete VolumeBindingMode: Immediate Events: <none>
oc describe sc file Name: file IsDefaultClass: Yes Provisioner: kubernetes.io/azure-file Parameters: storageAccount=azurefileb96695 AllowVolumeExpansion: <unset> MountOptions: nouser_xattr dir_mode=0777 file_mode=0777 uid=1000 gid=2000 ReclaimPolicy: Delete VolumeBindingMode: Immediate Events: <none>

Environment:

Velero version (use velero version): Client: Version: v1.3.2 Git commit: 55a9914a3e4719fb1578529c45430a8c11c28145 Server: Version: v1.3.2
Velero features (use velero client config get features): features: <NOT SET>
Kubernetes version (use kubectl version): oc v3.11.0+39132cb-398 kubernetes v1.11.0+d4cacc0
Cloud provider or hardware configuration: AZURE
OS (e.g. from /etc/os-release): NAME=“CentOS Linux” VERSION=“7 (Core)”

Thanks for your help! Regards Tobias

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 29 (12 by maintainers)

Most upvoted comments

@ashish-amarnath Just for your information I’m on vacation from 01.08. - 17.08.20. If you need some logfiles or information to this issue my colleagues @Mr-iX and @philipp1992 can help you. Regards Tobias

burkhat on Jul 29, 2020

Hello @ashish-amarnath,

Sorry for my late reply but yesterday I had a lot to do.

I’ve collected all logs and then I’ve deleted all pods to find out which pod is in hanging state and can’t be terminated: restic-m576l 1/1 Terminating 0 6d 10.128.2.89 node-6 <none>

oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE restic-2blrj 1/1 Running 0 32s 10.130.3.188 infra-1 <none> restic-2mzg5 1/1 Running 0 43s 10.131.4.141 node-3 <none> restic-5h5d6 1/1 Running 0 42s 10.128.0.12 master-0 <none> restic-7xgzg 1/1 Running 0 44s 10.130.5.227 infra-2 <none> restic-8jwh7 1/1 Running 0 32s 10.128.4.246 node-1 <none> restic-bn8ps 1/1 Running 0 35s 10.129.7.11 node-5 <none> restic-fxqbp 1/1 Running 0 37s 10.129.5.138 infra-0 <none> restic-ggwnc 1/1 Running 0 31s 10.129.1.47 master-1 <none> restic-m576l 1/1 Terminating 0 6d 10.128.2.89 node-6 <none> restic-m95w5 1/1 Running 0 38s 10.131.0.201 node-7 <none> restic-q5bhp 1/1 Running 0 35s 10.128.7.104 node-2 <none> restic-rbtww 1/1 Running 0 39s 10.129.2.69 node-0 <none> restic-sxq4j 1/1 Running 0 42s 10.130.0.110 master-2 <none> restic-zrvjz 1/1 Running 0 38s 10.131.3.97 node-4 <none> velero-6b6cbdb75f-svhvf 1/1 Running 0 44s 10.131.0.200 node-7 <none>

I’ve uploaded the logs of velero-6b6cbdb75f-svhvf and restic-m576l to gist. Also I’ve collected the origin-node.service logs from node-6 and grepped for azure.

In the master-controller-log I also found an entry to azure-file but this issue was fixed by me today morning. https://gist.github.com/burkhat/015271e29bf569024a1e00ba212d8e2e

The issue can occur on every worker-node (node-*) I can’t see if it happens on some node more than on others.

Regards Tobias

burkhat on Jul 14, 2020

@burkhat Thank you for your patience. I’ve added this to my list. Will take a look at it today.

ashish-amarnath on Jul 7, 2020

cc @vmware-tanzu/velero-maintainers

skriss on Jun 16, 2020

@skriss at the moment it is still working fine I will update you at the end of this week

burkhat on Jun 16, 2020

@skriss I’ve now cleared all logs from the Pods and I’m now waiting until the error occurs again. I will send you the logs as soon as the error appears again.

burkhat on Jun 10, 2020

Hello @carlisia,

we had following limit settings for velero and restic: velero: resources: limits: cpu: “2” memory: 2Gi requests: cpu: “1” memory: 512Mi

restic: resources: limits: cpu: “2” memory: 1Gi requests: cpu: 500m memory: 512Mi

I’ve now increased the limits as follows: velero: resources: limits: cpu: “4” memory: 4Gi requests: cpu: “1” memory: 512Mi

restic: resources: limits: cpu: “4” memory: 4Gi requests: cpu: 500m memory: 512Mi

I will update you next week if we still have the problems.

Thanks & Regards Tobias

burkhat on May 29, 2020