velero: velero_backup_last_status indicates failed backup, while in reality the backup was successful

What steps did you take and what happened:

We have a Schedule that takes a backup every night which is named velero-daily. When we deployed a new AMI to our cluster, all our nodes were replaced with the new AMI version, effectively meaning that the velero pod moved to a new node. After that we noticed that the velero_backup_last_status{schedule="velero-daily"} metric indicated that the last backup failed. While in reality the backup was successful.

What did you expect to happen: I expected the velero_backup_last_status metric to output a 1 instead of a 0, because the backup was successful.

When querying the velero_backup_last_status metric, we get a 0 indicating that the backup failed, thus triggering our alert. However note that the velero_backup_last_successful_timestamp shows that the backup was actually successful, so both metrics contradict each other.

# HELP velero_backup_last_status Last status of the backup. A value of 1 is success, 0 is failure
# TYPE velero_backup_last_status gauge
velero_backup_last_status{schedule=""} 0
velero_backup_last_status{schedule="velero-daily"} 0

# HELP velero_backup_last_successful_timestamp Last time a backup ran successfully, Unix timestamp in seconds
# TYPE velero_backup_last_successful_timestamp gauge
velero_backup_last_successful_timestamp{schedule="velero-daily"} 1.694484087e+09

The CLI shows that the backup has completed with no errors/warnings.

❯ velero get backup
NAME                          STATUS      ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
velero-daily-20230912020053   Completed   0        0          2023-09-12 04:00:53 +0200 CEST   29d       default            <none>

The Backup object also shows that the backup was taken successfully.

❯ velero get backup -o yaml velero-daily-20230912020053
apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    helm.sh/hook: post-install,post-upgrade,post-rollback
    helm.sh/hook-delete-policy: before-hook-creation
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"velero.io/v1","kind":"Schedule","metadata":{"annotations":{"helm.sh/hook":"post-install,post-upgrade,post-rollback","helm.sh/hook-delete-policy":"before-hook-creation"},"labels":{"app.kubernetes.io/instance":"velero","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"velero","argocd.argoproj.io/instance":"velero","helm.sh/chart":"velero-4.4.1"},"name":"velero-daily","namespace":"velero"},"spec":{"schedule":"0 2 * * *","template":{"defaultVolumesToFsBackup":false,"storageLocation":"default","ttl":"720h","volumeSnapshotLocations":["default"]}}}
    velero.io/source-cluster-k8s-gitversion: v1.27.4-eks-2d98532
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: 27+
  creationTimestamp: "2023-09-12T02:00:53Z"
  generation: 24
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    argocd.argoproj.io/instance: velero
    helm.sh/chart: velero-4.4.1
    touch: me
    velero.io/schedule-name: velero-daily
    velero.io/storage-location: default
  managedFields:
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:helm.sh/hook: {}
          f:helm.sh/hook-delete-policy: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
          f:velero.io/source-cluster-k8s-gitversion: {}
          f:velero.io/source-cluster-k8s-major-version: {}
          f:velero.io/source-cluster-k8s-minor-version: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:argocd.argoproj.io/instance: {}
          f:helm.sh/chart: {}
          f:velero.io/schedule-name: {}
          f:velero.io/storage-location: {}
      f:spec:
        .: {}
        f:csiSnapshotTimeout: {}
        f:defaultVolumesToFsBackup: {}
        f:hooks: {}
        f:itemOperationTimeout: {}
        f:metadata: {}
        f:storageLocation: {}
        f:ttl: {}
        f:volumeSnapshotLocations: {}
      f:status:
        .: {}
        f:completionTimestamp: {}
        f:expiration: {}
        f:formatVersion: {}
        f:phase: {}
        f:progress:
          .: {}
          f:itemsBackedUp: {}
          f:totalItems: {}
        f:startTimestamp: {}
        f:version: {}
        f:volumeSnapshotsAttempted: {}
        f:volumeSnapshotsCompleted: {}
    manager: velero-server
    operation: Update
    time: "2023-09-12T02:01:27Z"
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:touch: {}
    manager: kubectl-edit
    operation: Update
    time: "2023-09-12T07:34:28Z"
  name: velero-daily-20230912020053
  namespace: velero
  resourceVersion: "24185233"
  uid: 4bb4508b-30df-4bc5-8f22-db4af3de0938
spec:
  csiSnapshotTimeout: 10m0s
  defaultVolumesToFsBackup: false
  hooks: {}
  itemOperationTimeout: 1h0m0s
  metadata: {}
  storageLocation: default
  ttl: 720h0m0s
  volumeSnapshotLocations:
  - default
status:
  completionTimestamp: "2023-09-12T02:01:27Z"
  expiration: "2023-10-12T02:00:53Z"
  formatVersion: 1.1.0
  phase: Completed
  progress:
    itemsBackedUp: 2103
    totalItems: 2103
  startTimestamp: "2023-09-12T02:00:53Z"
  version: 1
  volumeSnapshotsAttempted: 19
  volumeSnapshotsCompleted: 19

Environment:

  • Velero version (use velero version):
❯ velero version
Client:
	Version: v1.11.1
	Git commit: bdbe7eb242b0f64d5b04a7fea86d1edbb3a3587c
Server:
	Version: v1.11.1
  • Velero features (use velero client config get features):
❯ velero client config get features
features: <NOT SET> 
  • Kubernetes version (use kubectl version):
❯ kubectl version
Client Version: v1.28.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.4-eks-2d98532
  • Cloud provider or hardware configuration: EKS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project’s top voted issues listed here.
Use the “reaction smiley face” up to the right of this comment to vote.

  • 👍 for “I would like to see this bug fixed as soon as possible”
  • 👎 for “There are more important bugs to focus on right now”

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 7
  • Comments: 15 (11 by maintainers)

Most upvoted comments

But there is another issue may occur, the metricsvelero_backup_last_status will always be 1, if the backup never begins…

The default value of velero_backup_last_status metric is 0 which indicate failure. As long as the code exits without updating the metrics, it will reports failure.

The better way to handle velero_backup_last_status metric is to set the default value to 1, indicating success, then change the value to 0 if the code exit with a failure.