longhorn: [BUG] Can not delete type=`bi` VolumeSnapshot if related backing image not exist

Describe the bug (๐Ÿ› if you encounter this issue)

Delete type=bi VolumeSnapshot will stuck if backing image not exist

To Reproduce

Steps to reproduce the behavior:

  1. Deploy the CSI snapshot CRDs and related Controller v6.2.1

  2. Deploy Longhorn

  3. Create VolumeSnapshotClass longhorn-snapshot-vsc

    manifest
         kind: VolumeSnapshotClass
         apiVersion: snapshot.storage.k8s.io/v1
         metadata:
             name: longhorn-snapshot-vsc
         driver: driver.longhorn.io
         deletionPolicy: Delete
         parameters:
             type: bi  
    
  1. Create VolumeSnapshotContent test-on-demand-backing

    manifest
       apiVersion: snapshot.storage.k8s.io/v1
       kind: VolumeSnapshotContent
       metadata:
           name: test-on-demand-backing
       spec:
           volumeSnapshotClassName: longhorn-snapshot-vsc
           driver: driver.longhorn.io
           deletionPolicy: Delete
           source:
               snapshotHandle: bi://backing?backingImageDataSourceType=download&backingImage=test-bi&url=https%3A%2F%2Flonghorn-backing-image.s3-us-west-1.amazonaws.com%2Fparrot.qcow2&backingImageChecksum=bd79ab9e6d45abf4f3f0adf552a868074dd235c4698ce7258d521160e0ad79ffe555b94e7d4007add6e1a25f4526885eb25c53ce38f7d344dd4925b9f2cb5d3b  # NOQA
           volumeSnapshotRef:
               name: test-snapshot-on-demand-backing
               namespace: default
    
  2. Create VolumeSnapshot test-snapshot-on-demand-backing

    manifest
       apiVersion: snapshot.storage.k8s.io/v1
       kind: VolumeSnapshot
       metadata:
           name: test-snapshot-on-demand-backing
       spec:
           volumeSnapshotClassName: longhorn-snapshot-vsc
           source:
               volumeSnapshotContentName: test-on-demand-backing
    
  3. Delete VolumeSnapshot test-snapshot-on-demand-backing, the deletion stuck. (If Longhorn had a backing image named test-bi before execute step 6, the delete will success.)

Expected behavior

Delete should success no matter backing image exist

Log or Support bundle

longhorn-manager

time="2023-07-10T02:03:00Z" level=warning msg="HTTP handling error" error="failed to get backing image 'test-bi': backingimage.longhorn.io \"test-bi\" not found"
time="2023-07-10T02:03:00Z" level=error msg="Error in request: failed to get backing image 'test-bi': backingimage.longhorn.io \"test-bi\" not found"
10.42.0.31 - - [10/Jul/2023:02:03:00 +0000] "GET /v1/backingimages/test-bi HTTP/1.1" 500 248 "" "Go-http-client/1.1"
time="2023-07-10T02:05:08Z" level=warning msg="HTTP handling error" error="failed to get backing image 'test-bi': backingimage.longhorn.io \"test-bi\" not found"
time="2023-07-10T02:05:08Z" level=error msg="Error in request: failed to get backing image 'test-bi': backingimage.longhorn.io \"test-bi\" not found"

csi-snapshotter

E0710 01:55:06.791510       1 snapshot_controller_base.go:283] could not sync content "test-on-demand-backing": failed to delete snapshot "test-on-demand-backing", err: failed to delete snapshot content test-on-demand-backing: "rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [detail=, message=failed to get backing image 'test-bi': backingimage.longhorn.io \"test-bi\" not found, code=Server Error] from [http://longhorn-backend:9500/v1/backingimages/test-bi]"

supportbundle_f16395ed-43c1-480c-bbb3-eec929c91479_2023-07-10T02-07-09Z.zip

Environment

  • Longhorn version: master, v1.5.x, v1.4.x
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.27.3+k3s1
  • Node config
    • OS type and version: Ubuntu 22.04
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS

Additional context

https://github.com/longhorn/longhorn/issues/6231

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 23 (22 by maintainers)

Most upvoted comments

Oh I got it Yeah

@innobead return 404 not found is api server implementation from client it is quite hard to analyze the error, unless we check the error message, but then we need to unify the error message from server Currently I donโ€™t have any idea for this middle layer cc @shuo-wu

@ChanYiLin What I mean is not client side but server side, like add your change in the below code in general to handle for all endpoint functions. WDYT?

/api/router.go#L16-L24

func HandleError(s *client.Schemas, t HandleFuncWithError) http.Handler {
	return api.ApiHandler(s, http.HandlerFunc(func(rw http.ResponseWriter, req *http.Request) {
		if err := t(rw, req); err != nil {
                    logrus.WithError(err).Warnf("HTTP handling error")  
                    
                    if datastore.ErrorIsNotFound(err) {
		        rw.WriteHeader(http.StatusNotFound)
                        return nil
		    }

		    apiContext := api.GetApiContext(req)
		    apiContext.WriteErr(err)
		}
	}))
}

I think this solution is good enough for current issue and give the right status code (404) for all not found error For other errors, as you said, we are currently restricted by the lib.

I think we can create another issue for handling other errors

@ChanYiLin It seems PRs merged. Please remember to move it to ready for testing if all done.

@ChanYiLin Sounds good, please help create the issue to track it later by self-assigning, also I am not sure if the go-rancher has any new version w/ a related improvement. or not or even we can contribute back to it.

@innobead return 404 not found is api server implementation from client it is quite hard to analyze the error, unless we check the error message, but then we need to unify the error message from server

Currently I donโ€™t have any idea for this middle layer

cc @shuo-wu

@ChanYiLin What I mean is not client side but server side, like add your change in the below code in general to handle for all endpoint functions. WDYT?

/api/router.go#L16-L24

func HandleError(s *client.Schemas, t HandleFuncWithError) http.Handler {
	return api.ApiHandler(s, http.HandlerFunc(func(rw http.ResponseWriter, req *http.Request) {
		if err := t(rw, req); err != nil {
                    logrus.WithError(err).Warnf("HTTP handling error")  
                    
                    if datastore.ErrorIsNotFound(err) {
		        rw.WriteHeader(http.StatusNotFound)
                        return nil
		    }

		    apiContext := api.GetApiContext(req)
		    apiContext.WriteErr(err)
		}
	}))
}

Hi @innobead there is no need to backport to 1.3.X and 1.4.X since backing image with csi feature is only introduced in v1.5.x thanks!

Hi @innobead ,

Sorry I tested on v1.4.3-rc1 this morning, VolumeSnpshotClass, VolumeSnapshotContent, VolumeSnapshot, can create without problem, but I just notice that create PVC and backing image from VolumeSnapshot not support, I am sorry for the misleading information.

Hi @innobead there is no need to backport to 1.3.X and 1.4.X since backing image with csi feature is only introduced in v1.5.x thanks!

It seems backup type snapshot will have the same issue If someone delete the backupVolume which is created by the volumesnapshot through UI Then one canโ€™t delete the volumesnapshot after as well

Oh I found the reason why it failed

  • In csi, before deleting the backing image, it will get the backing image through api (not datastore) in generated_client
func (c *BackingImageClient) ById(id string) (*BackingImage, error) {
	resp := &BackingImage{}
	err := c.rancherClient.doById(BACKING_IMAGE_TYPE, id, resp)
	if apiError, ok := err.(*ApiError); ok {
		if apiError.StatusCode == 404 {
			return nil, nil
		}
	}
	return resp, err
}

we do check if it is not found (404) and return nil, and should bypass the deletion process

  • However, in server part, we raise 500 error with backing image not found
func (s *Server) BackingImageGet(rw http.ResponseWriter, req *http.Request) error {
	apiContext := api.GetApiContext(req)

	id := mux.Vars(req)["name"]

	bi, err := s.m.GetBackingImage(id)
	if err != nil {
		return errors.Wrapf(err, "failed to get backing image '%s'", id)
	}

	apiContext.Write(toBackingImageResource(bi, apiContext))
	return nil
}

So the csi failed to process because of the error.

Removed from 1.5.1 and 1.4.3 which will be released in the near future, so this can be tackled in the following release instead.