moby: docker service create doesn't work when network and generic-resource are both attached

Description

Issue

In docker swarm, --generic-resource does not work when it is used alongside --network. This is due to an incorrect condition, if genericresource.HasResource(ta, available.Generic), in the constraint_enforcer.go code when the service is brought up.

for _, ta := range t.AssignedGenericResources {
	// Type change or no longer available
	if genericresource.HasResource(ta, available.Generic) {
		removeTasks[t.ID] = t
		break loop
	}
}

The code should read if !genericresource.HasResource(ta, available.Generic) { so that the task which has an assigned and available GenericResource is not removed.

This is bug is important as it prevents the usage of generic resources in Docker Swarm; this is particularly relevant for assigning services to nodes based on GPU availability.

The generic-resources feature used to work properly in swarm in version 18.06.1.

Reproduce

Bug Investigation + Reproduction steps

This functionality was working in version 18.06.1 but not in any version afterwards.

Each release was tested through these steps. An additional condition that is required for the bug to occur is that the service must be being brought up on a non-manager swarm node:

  1. Initialize the swarm
docker swarm init --advertise-addr <host_ip>
  1. Create an overlay network.
docker network create -d overlay --scope swarm test-network
  1. Modify /etc/docker/daemon.json on the worker to add an item to node-generic-resources. Restart the docker service
{
  ...
  "node-generic-resources": [
      "gpu_<type>=GPU-sample-id",
  ]
}
  1. Add a worker node to the swarm.

  2. Create a service that with the network attached as well as a generic-resource. This step will fail and the service will never get to the running state.

docker service create --network test-network --generic-resource "gpu_<type>=1" --name test-service quay.io/centos/centos:stream8 bash -c "env && sleep infinity"
  1. Observe the error by running docker service ps. These errors continue in a loop where a new task is created and subsequently rejected. This error does not resolve by itself and the service never reaches the Running state.
docker service ps --no-trunc test-service

# Example output

ID                          NAME                 IMAGE                                  NODE                     DESIRED STATE   CURRENT STATE             ERROR                                              PORTS
4tqh3odt54qhp3mzumwjt3mpj   test-service.1       quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Ready           Rejected 4 seconds ago    "assigned node no longer meets constraints"
j6d0qwxv2kglyzugpgd98qx1f    \_ test-service.1   quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Shutdown        Rejected 9 seconds ago    "assigned node no longer meets constraints"
etrnb3per1lgej9wgp9az2u4d    \_ test-service.1   quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Shutdown        Rejected 9 seconds ago    "assigned node no longer meets constraints"
inyz4ez2i95rd5vyxggo0mgbk    \_ test-service.1   quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Shutdown        Rejected 14 seconds ago   "assigned node no longer meets constraints"
ymy14a55fbv6cgdaxl4nzy06h    \_ test-service.1   quay.io/centos/centos:stream8@<hash>   <worker_node_hostname>   Shutdown        Rejected 19 seconds ago   "assigned node no longer meets constraints"
  1. Remove the worker node from the swarm and repeat steps 4-6 for different versions of docker-ce

Expected behavior

docker service create should create a service with a network and generic-resource attached.

docker version

Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:03:11 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:01:29 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 28
  Running: 6
  Paused: 0
  Stopped: 22
 Images: 91
 Server Version: 20.10.18
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: local
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: <node_id>
  Is Manager: false
  Node Address: <node_address>
  Manager Addresses:
   <manager_1_address>
   <manager_2_address>
   <manager_3_address>
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: 0197261a30bf81f1ee8e6a4dd2dea0ef95d67ccb
 runc version: v1.1.3-0-g6724737
 init version: de40ad0
 Kernel Version: <redacted>
 Operating System: <redacted>
 OSType: linux
 Architecture: x86_64
 CPUs: <redacted>
 Total Memory: <redacted>
 Name: <node_name>
 ID: <redacted>
 Debug Mode: false
 Experimental: false
 Live Restore Enabled: false

Additional Info

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 23 (14 by maintainers)

Commits related to this issue

Most upvoted comments

The fix for this should be included in 23.0.2.