moby: Unused CSI Volumes stay in use if services using them are removed in the wrong order
Description
CSI volumes have an issue around state transitions which lets them stay “in use” if a service using them is removed without the volume being drained first, leaving the volume to not be able to be removed without -f
.
This behaviour was observed with hetzner cloud csi and democratic-csi local hostpath.
This does not work:
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume create --driver neuroforgede/swarm-csi-local-path:v1.8.3 --availability active --scope single --sharing none --type mount my-csi-local-volume
my-csi-local-volume
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker service create --name my-service --mount type=cluster,src=my-csi-local-volume,dst=/usr/share/nginx/html --publish 8080:80 nginx
boh6bvmtxoq8an04jqkx2ramr
overall progress: 1 out of 1 tasks
1/1: running [==================================================>]
verify: Service converged
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker service rm my-service
my-service
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume update --availability drain my-csi-local-volume
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume ls --cluster
VOLUME NAME GROUP DRIVER AVAILABILITY STATUS
my-csi-local-volume neuroforgede/swarm-csi-local-path:v1.8.3 drain in use (1 node)
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume update --availability active my-csi-local-volume
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume ls --cluster
VOLUME NAME GROUP DRIVER AVAILABILITY STATUS
my-csi-local-volume neuroforgede/swarm-csi-local-path:v1.8.3 active in use (1 node)
(reverse-i-search)`': ^C
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume rm my-csi-local-volume
Error response from daemon: rpc error: code = FailedPrecondition desc = volume is still in use
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$
This works:
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume create --driver neuroforgede/swarm-csi-local-path:v1.8.3 --availability active --scope single --sharing none --type mount my-csi-local-volume
my-csi-local-volume
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker service create --name my-service --mount type=cluster,src=my-csi-local-volume,dst=/usr/share/nginx/html --publish 8080:80 nginx
ngp9coiy14wresuseasgzri7v
overall progress: 1 out of 1 tasks
1/1: running [==================================================>]
verify: Service converged
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume update --availability drain my-csi-local-volume
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
ngp9coiy14wr my-service replicated 0/1 nginx:latest *:8080->80/tcp
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker service rm my-service
my-service
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ vol^C
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume ls --cluster
VOLUME NAME GROUP DRIVER AVAILABILITY STATUS
my-csi-local-volume neuroforgede/swarm-csi-local-path:v1.8.3 drain in use (1 node)
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume update --availability active my-csi-local-volume
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume ls --cluster
VOLUME NAME GROUP DRIVER AVAILABILITY STATUS
my-csi-local-volume neuroforgede/swarm-csi-local-path:v1.8.3 active created
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume rm my-csi-local-volume
my-csi-local-volume
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$ docker volume ls --cluster
VOLUME NAME GROUP DRIVER AVAILABILITY STATUS
martinb@ubuntu:~/csi-plugins-for-docker-swarm/democratic-csi/local-hostpath$
Note that even in this case, the status change from “in use” to “created” is triggered by the availability update, which leads me to believe that there are some state transitions events that we are missing.
Reproduce
- docker volume create --driver neuroforgede/swarm-csi-local-path:v1.8.3 --availability active --scope single --sharing none --type mount my-csi-local-volume
- docker volume ls --cluster
- docker service create --name my-service --mount type=cluster,src=my-csi-local-volume,dst=/usr/share/nginx/html --publish 8080:80 nginx
- docker service rm my-service
- wait 1 minute to see if something changes in
docker volume ls --cluster
- docker volume update --availability drain my-csi-local-volume
- wait 1 minute to see if something changes in
docker volume ls --cluster
Expected behavior
Unused CSI volumes should automatically switch from “in use” to “created”.
docker version
Client: Docker Engine - Community
Version: 23.0.6
API version: 1.42
Go version: go1.19.9
Git commit: ef23cbc
Built: Fri May 5 21:18:22 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 23.0.6
API version: 1.42 (minimum version 1.12)
Go version: go1.19.9
Git commit: 9dbdbd4
Built: Fri May 5 21:18:22 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.21
GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
Version: 1.1.7
GitCommit: v1.1.7-0-g860f061
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.4
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.17.3
Path: /usr/libexec/docker/cli-plugins/docker-compose
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
Server:
Containers: 66
Running: 10
Paused: 0
Stopped: 56
Images: 109
Server Version: 23.0.6
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: jnluxx8q8ghcla3zq2obm3ro6
Is Manager: true
ClusterID: vowmrzc70hmfksc5kfj9hckpf
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.116.129
Manager Addresses:
192.168.116.129:2377
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc version: v1.1.7-0-g860f061
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
Kernel Version: 5.15.0-71-generic
Operating System: Ubuntu 20.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 24
Total Memory: 38.29GiB
Name: ubuntu
ID: 754B:FW3A:TJ3E:V5NJ:63QM:6HAQ:J4KL:K7JK:3XYY:CPCB:OUZK:XFWI
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: ancieque
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional Info
Workaround with draining before removing the service originally discovered by @sidpalas with Hetzner Cloud CSI
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 16 (13 by maintainers)
It’s not vendored in moby/moby master yet, so I doubt it
see how the code is missing here: https://github.com/moby/moby/blob/master/vendor/github.com/moby/swarmkit/v2/manager/scheduler/scheduler.go
so I’d say we wait for Drew to weigh in, but the code changes I read in that PR look like it has been fixed by that PR.
I see the problem now. The way freeing volumes works, we look for volumes to remove each time we do a pass of the scheduler. Deleting a Service doesn’t cause a scheduling pass, so we don’t free the Volumes. In theory, some other scheduling event, like creating a new service, ought to cause the scheduling pass that successfully frees the Volume? I need to check…
One big weak point in CSI, and by extension Swarm’s use of it, is that the state transitions of a CSI volume are strict. We cannot Unpublish a Volume on the Controller side until it has been Unpublished and Unstaged on the Node. To adhere to this restriction, we must get an affirmative signal from the Swarm Agent that a Volume has been successfully Unstaged.
If something goes wrong in the Unstage process, the Volume will be stuck “In Use”, possibly forever. This could happen if the Node is struck by lightning or falls through a crack in reality into the great nothingness between worlds. I’m unsure how Kubernetes handles such a case.
This is present across plugins, though, which makes me strong suspect the problem is an issue with Swarm’s implementation.
cc @dperny