longhorn: [BUG] Data consistency lose
Describe the bug (π if you encounter this issue)
We are using Argo WF and as our storage we use Longhorn. Between steps in Workflow we observe data lose, which impact our production clusters. In Longhorn logs we observe a lot of synchronisation errors.
time="2023-08-31T22:42:10Z" level=error msg="Dropping Longhorn replica longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d out of the queue" controller=longhorn-replica error="failed to sync replica for longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d: failed to cleanup the related replica instance before deleting replica pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d: failed to get engine image object based on image name [longhornio/longhorn-engine:v1.4.0:](longhornio/longhorn-engine:v1.4.0:) [engineimage.longhorn.io](http://engineimage.longhorn.io/) \"ei-c5fd9691\" not found" node=xyz
To Reproduce
Run Argo WF with longhorn as a backed on EKS, with lots of small steps.
Expected behavior
All data is synced and we not see any data lose between argo steps.
Environment
- Longhorn version: 1.5.1
- Installation method Helm
- Kubernetes distro EKS
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
Additional context
We observe this issue periodically, one day we see a lot of failing builds, next day everything working correctly.
About this issue
- Original URL
- State: open
- Created 10 months ago
- Reactions: 2
- Comments: 18 (9 by maintainers)
No, when we upgrade longhorn we do not have any volumes - we treat them like temporary storage so it can be destroyed when we want to do it so we remove all volumes before we start the upgrade process
Regarding this log - we fixed it by removing of non-existing replicas (in custom resources meaning) - itβs not a problem now. Problem is data inconsistency.
@derekbit we still constantly observe that issue. One thing what I already mentioned we have a lot of ERROR logs with wrong version. We are using longhorn version
v1.5.1and engine image is correct also hasv1.5.1version. But logs shows: