velero: velero_backup_last_status indicates failed backup, while in reality the backup was successful
What steps did you take and what happened:
We have a Schedule
that takes a backup every night which is named velero-daily
. When we deployed a new AMI to our cluster, all our nodes were replaced with the new AMI version, effectively meaning that the velero pod moved to a new node. After that we noticed that the velero_backup_last_status{schedule="velero-daily"}
metric indicated that the last backup failed. While in reality the backup was successful.
What did you expect to happen:
I expected the velero_backup_last_status
metric to output a 1
instead of a 0
, because the backup was successful.
When querying the velero_backup_last_status
metric, we get a 0
indicating that the backup failed, thus triggering our alert.
However note that the velero_backup_last_successful_timestamp
shows that the backup was actually successful, so both metrics contradict each other.
# HELP velero_backup_last_status Last status of the backup. A value of 1 is success, 0 is failure
# TYPE velero_backup_last_status gauge
velero_backup_last_status{schedule=""} 0
velero_backup_last_status{schedule="velero-daily"} 0
# HELP velero_backup_last_successful_timestamp Last time a backup ran successfully, Unix timestamp in seconds
# TYPE velero_backup_last_successful_timestamp gauge
velero_backup_last_successful_timestamp{schedule="velero-daily"} 1.694484087e+09
The CLI shows that the backup has completed with no errors/warnings.
❯ velero get backup
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
velero-daily-20230912020053 Completed 0 0 2023-09-12 04:00:53 +0200 CEST 29d default <none>
The Backup
object also shows that the backup was taken successfully.
❯ velero get backup -o yaml velero-daily-20230912020053
apiVersion: velero.io/v1
kind: Backup
metadata:
annotations:
helm.sh/hook: post-install,post-upgrade,post-rollback
helm.sh/hook-delete-policy: before-hook-creation
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"velero.io/v1","kind":"Schedule","metadata":{"annotations":{"helm.sh/hook":"post-install,post-upgrade,post-rollback","helm.sh/hook-delete-policy":"before-hook-creation"},"labels":{"app.kubernetes.io/instance":"velero","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"velero","argocd.argoproj.io/instance":"velero","helm.sh/chart":"velero-4.4.1"},"name":"velero-daily","namespace":"velero"},"spec":{"schedule":"0 2 * * *","template":{"defaultVolumesToFsBackup":false,"storageLocation":"default","ttl":"720h","volumeSnapshotLocations":["default"]}}}
velero.io/source-cluster-k8s-gitversion: v1.27.4-eks-2d98532
velero.io/source-cluster-k8s-major-version: "1"
velero.io/source-cluster-k8s-minor-version: 27+
creationTimestamp: "2023-09-12T02:00:53Z"
generation: 24
labels:
app.kubernetes.io/instance: velero
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: velero
argocd.argoproj.io/instance: velero
helm.sh/chart: velero-4.4.1
touch: me
velero.io/schedule-name: velero-daily
velero.io/storage-location: default
managedFields:
- apiVersion: velero.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:helm.sh/hook: {}
f:helm.sh/hook-delete-policy: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:velero.io/source-cluster-k8s-gitversion: {}
f:velero.io/source-cluster-k8s-major-version: {}
f:velero.io/source-cluster-k8s-minor-version: {}
f:labels:
.: {}
f:app.kubernetes.io/instance: {}
f:app.kubernetes.io/managed-by: {}
f:app.kubernetes.io/name: {}
f:argocd.argoproj.io/instance: {}
f:helm.sh/chart: {}
f:velero.io/schedule-name: {}
f:velero.io/storage-location: {}
f:spec:
.: {}
f:csiSnapshotTimeout: {}
f:defaultVolumesToFsBackup: {}
f:hooks: {}
f:itemOperationTimeout: {}
f:metadata: {}
f:storageLocation: {}
f:ttl: {}
f:volumeSnapshotLocations: {}
f:status:
.: {}
f:completionTimestamp: {}
f:expiration: {}
f:formatVersion: {}
f:phase: {}
f:progress:
.: {}
f:itemsBackedUp: {}
f:totalItems: {}
f:startTimestamp: {}
f:version: {}
f:volumeSnapshotsAttempted: {}
f:volumeSnapshotsCompleted: {}
manager: velero-server
operation: Update
time: "2023-09-12T02:01:27Z"
- apiVersion: velero.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
f:touch: {}
manager: kubectl-edit
operation: Update
time: "2023-09-12T07:34:28Z"
name: velero-daily-20230912020053
namespace: velero
resourceVersion: "24185233"
uid: 4bb4508b-30df-4bc5-8f22-db4af3de0938
spec:
csiSnapshotTimeout: 10m0s
defaultVolumesToFsBackup: false
hooks: {}
itemOperationTimeout: 1h0m0s
metadata: {}
storageLocation: default
ttl: 720h0m0s
volumeSnapshotLocations:
- default
status:
completionTimestamp: "2023-09-12T02:01:27Z"
expiration: "2023-10-12T02:00:53Z"
formatVersion: 1.1.0
phase: Completed
progress:
itemsBackedUp: 2103
totalItems: 2103
startTimestamp: "2023-09-12T02:00:53Z"
version: 1
volumeSnapshotsAttempted: 19
volumeSnapshotsCompleted: 19
Environment:
- Velero version (use
velero version
):
❯ velero version
Client:
Version: v1.11.1
Git commit: bdbe7eb242b0f64d5b04a7fea86d1edbb3a3587c
Server:
Version: v1.11.1
- Velero features (use
velero client config get features
):
❯ velero client config get features
features: <NOT SET>
- Kubernetes version (use
kubectl version
):
❯ kubectl version
Client Version: v1.28.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.4-eks-2d98532
- Cloud provider or hardware configuration: EKS
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project’s top voted issues listed here.
Use the “reaction smiley face” up to the right of this comment to vote.
- 👍 for “I would like to see this bug fixed as soon as possible”
- 👎 for “There are more important bugs to focus on right now”
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 7
- Comments: 15 (11 by maintainers)
But there is another issue may occur, the metrics
velero_backup_last_status
will always be 1, if the backup never begins…The default value of
velero_backup_last_status
metric is 0 which indicate failure. As long as the code exits without updating the metrics, it will reports failure.The better way to handle
velero_backup_last_status
metric is to set the default value to 1, indicating success, then change the value to 0 if the code exit with a failure.