longhorn: [BUG] Data consistency lose

Describe the bug (πŸ› if you encounter this issue)

We are using Argo WF and as our storage we use Longhorn. Between steps in Workflow we observe data lose, which impact our production clusters. In Longhorn logs we observe a lot of synchronisation errors.

time="2023-08-31T22:42:10Z" level=error msg="Dropping Longhorn replica longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d out of the queue" controller=longhorn-replica error="failed to sync replica for longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d: failed to cleanup the related replica instance before deleting replica pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d: failed to get engine image object based on image name [longhornio/longhorn-engine:v1.4.0:](longhornio/longhorn-engine:v1.4.0:) [engineimage.longhorn.io](http://engineimage.longhorn.io/) \"ei-c5fd9691\" not found" node=xyz

To Reproduce

Run Argo WF with longhorn as a backed on EKS, with lots of small steps.

Expected behavior

All data is synced and we not see any data lose between argo steps.

Environment

  • Longhorn version: 1.5.1
  • Installation method Helm
  • Kubernetes distro EKS
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 3

Additional context

We observe this issue periodically, one day we see a lot of failing builds, next day everything working correctly.

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Reactions: 2
  • Comments: 18 (9 by maintainers)

Most upvoted comments

No, when we upgrade longhorn we do not have any volumes - we treat them like temporary storage so it can be destroyed when we want to do it so we remove all volumes before we start the upgrade process

@derekbit we still constantly observe that issue. One thing what I already mentioned we have a lot of ERROR logs with wrong version. We are using longhorn version v1.5.1 and engine image is correct also has v1.5.1 version. But logs shows:

...
time="2023-09-18T08:43:23Z" level=error msg="Dropping Longhorn replica longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d out of the queue" controller=longhorn-replica error="failed to sync replica for longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d: failed to cleanup the related replica instance before deleting replica pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d: failed to get engine image object based on image name xyz.amazonaws.com/dok-cicd-registry/longhornio/longhorn-engine:v1.4.0: engineimage.longhorn.io \"ei-c5fd9691\" not found" node=xyz.compute.internal
...
time="2023-09-18T08:43:23Z" level=error msg="Error syncing Longhorn replica longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-208e44ad" controller=longhorn-replica error="failed to sync replica for longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-208e44ad: failed to cleanup the related replica instance before deleting replica pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-208e44ad: failed to get engine image object based on image name xyz.amazonaws.com/dok-cicd-registry/longhornio/longhorn-engine:v1.4.0: engineimage.longhorn.io \"ei-c5fd9691\" not found" node=xyz.compute.internal
...

Regarding this log - we fixed it by removing of non-existing replicas (in custom resources meaning) - it’s not a problem now. Problem is data inconsistency.

@derekbit we still constantly observe that issue. One thing what I already mentioned we have a lot of ERROR logs with wrong version. We are using longhorn version v1.5.1 and engine image is correct also has v1.5.1 version. But logs shows:

...
time="2023-09-18T08:43:23Z" level=error msg="Dropping Longhorn replica longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d out of the queue" controller=longhorn-replica error="failed to sync replica for longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d: failed to cleanup the related replica instance before deleting replica pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-fb7c099d: failed to get engine image object based on image name xyz.amazonaws.com/dok-cicd-registry/longhornio/longhorn-engine:v1.4.0: engineimage.longhorn.io \"ei-c5fd9691\" not found" node=xyz.compute.internal
...
time="2023-09-18T08:43:23Z" level=error msg="Error syncing Longhorn replica longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-208e44ad" controller=longhorn-replica error="failed to sync replica for longhorn-system/pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-208e44ad: failed to cleanup the related replica instance before deleting replica pvc-41a8caa6-6335-4065-842d-225e213d85a4-r-208e44ad: failed to get engine image object based on image name xyz.amazonaws.com/dok-cicd-registry/longhornio/longhorn-engine:v1.4.0: engineimage.longhorn.io \"ei-c5fd9691\" not found" node=xyz.compute.internal
...