moby: Tasks fail with error "Unable to complete atomic operation, key modified"

This is a spin-off of #35290 to address the error Unable to complete atomic operation, key modified that causes the Tasks to be scheduled on the node.

Oct 26 09:43:13 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:13.534861035-07:00" level=error msg="fatal task error" error="Unable to complete atomic operation, key modified" module=node/agent/taskmanager node.id=mv4e72vawng0s82vvnb0tatc7 service.id=m9484e8pmri988v4n3bossl0x task.id=c1qap6z4nr50k9cacus63d8is

Thanks to @abhi , we reproduced the issue with debug enabled. The root-cause of this seems to be introduced via https://github.com/moby/moby/pull/34674

Looking at the attached logs,

Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806458067-07:00" level=debug msg="releasing IPv4 pools from network orange (pz8tw8vw40ogxi01c3pcthldi)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806483219-07:00" level=debug msg="ReleaseAddress(LocalDefault/10.0.4.0/24, 10.0.4.1)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806504836-07:00" level=debug msg="ReleasePool(LocalDefault/10.0.4.0/24)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806643570-07:00" level=debug msg="moby-demo-ubuntu-1(48d7c3d0f4a7): leaving network pz8tw8vw40ogxi01c3pcthldi"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806657694-07:00" level=debug msg="cleanupServiceDiscovery for network:pz8tw8vw40ogxi01c3pcthldi"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806665552-07:00" level=debug msg="cleanupServiceBindings for pz8tw8vw40ogxi01c3pcthldi"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806890107-07:00" level=warning msg="driver error deleting endpoint orange-endpoint : network id \"pz8tw8vw40ogxi01c3pcthldi\" not found"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806904850-07:00" level=debug msg="Releasing addresses for endpoint orange-endpoint's interface on network orange"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806915275-07:00" level=debug msg="ReleaseAddress(LocalDefault/10.0.4.0/24, 10.0.4.16)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806935796-07:00" level=warning msg="Failed to release ip address 10.0.4.16 on delete of endpoint orange-endpoint (d68ba4cb8deea9ed20c05c0abbd19c4db3250824dbc9133afc5bf2977f8e4765): cannot find address pool for poolID:LocalDefault/10.0.4.0/24"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.806986817-07:00" level=error msg="network orange remove failed: unknown network orange id pz8tw8vw40ogxi01c3pcthldi" module=node/agent node.id=mv4e72vawng0s82vvnb0tatc7
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807004206-07:00" level=error msg="remove task failed" error="unknown network orange id pz8tw8vw40ogxi01c3pcthldi" module=node/agent node.id=mv4e72vawng0s82vvnb0tatc7 task.id=d675fekf2hc5vxx188d5c0r5e
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807146117-07:00" level=warning msg="driver error deleting endpoint orange-endpoint : network id \"pz8tw8vw40ogxi01c3pcthldi\" not found"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807160203-07:00" level=debug msg="Releasing addresses for endpoint orange-endpoint's interface on network orange"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807169786-07:00" level=debug msg="ReleaseAddress(LocalDefault/10.0.4.0/24, 10.0.4.16)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.807182810-07:00" level=warning msg="Failed to release ip address 10.0.4.16 on delete of endpoint orange-endpoint (d68ba4cb8deea9ed20c05c0abbd19c4db3250824dbc9133afc5bf2977f8e4765): cannot find address pool for poolID:LocalDefault/10.0.4.0/24"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809217709-07:00" level=error msg="network orange remove failed: unknown network orange id pz8tw8vw40ogxi01c3pcthldi" module=node/agent node.id=mv4e72vawng0s82vvnb0tatc7
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809233253-07:00" level=error msg="remove task failed" error="unknown network orange id pz8tw8vw40ogxi01c3pcthldi" module=node/agent node.id=mv4e72vawng0s82vvnb0tatc7 task.id=qofrrwn7nf917tq9s93es6s8h
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809383599-07:00" level=warning msg="driver error deleting endpoint orange-endpoint : network id \"pz8tw8vw40ogxi01c3pcthldi\" not found"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809398355-07:00" level=debug msg="Releasing addresses for endpoint orange-endpoint's interface on network orange"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809408880-07:00" level=debug msg="ReleaseAddress(LocalDefault/10.0.4.0/24, 10.0.4.16)"
Oct 26 09:43:11 moby-demo-ubuntu-1 dockerd[14670]: time="2017-10-26T09:43:11.809421881-07:00" level=warning msg="Failed to release ip address 10.0.4.16 on delete of endpoint orange-endpoint (d68ba4cb8deea9ed20c05c0abbd19c4db3250824dbc9133afc5bf2977f8e4765): cannot find address pool for poolID:LocalDefault/10.0.4.0/24"

It seems that even after the network is cleaned up on a worker node, the Load-balancer endpoint and sandbox for that network is continuously tried to be cleaned up and keeps failing. Because of this, the endpoint_cnt key is created on the store while handling the endpoint cleanup phase. When the network is created again, it fails because of existing key.

We could hack endpoint_cnt to prevent this failure. But I think there is a fundamental issue with the way the per-network-per-node LB endpoint and sandbox are managed. We must resolve this asap.

Should we back-off #34674 till the issue is properly addressed ? cc @pradipd @vieux @fcrisciani @abhi

daemon.log

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 10
  • Comments: 58 (20 by maintainers)

Most upvoted comments

This is resolved through https://github.com/moby/moby/pull/35422, which is merged in master and will be included in the next Docker release as well

upd: refactored text, so it’s less confusing

I have updated the cluster to 17.11. Issues:

  1. After updating docker engine from 17.07 to 17.11 on all nodes, Swarm Mode has gone. 50% VMs have ‘eth1’ renamed to ‘veth…’

  2. After swarm is recreated and docker stack ran (services endpoint mode is VIP) it worked stable for a while, but then services started to occasionally hang. In logs there are timeouts in their connection to external LDAP or from a reverse proxy service to other services on remote nodes. Happens not immediately, after ~1h of running or more.

  3. Enormous amount of (10-20 per second) dockerd[25583]: time="2017-11-24T11:03:14.756972678+01:00" level=info msg="Node join event ..." - it beings enormous IO. Is there a way I can switch it off / reduce? I am not sure it’s expected. expected, as of #33962

No “atomic operation failed” issue spotted so far. But (3) behavior is in somewhat similar to the one I had on 17.10 having services in VIP mode and had to switch to DNSRR therefore back then.

troubleshooting update

In an hour or so after restarting services (or docker engine) problems arise. Logs attached. @thaJeztah @mavenugo @fcrisciani Can you please help? … plus, restarting a node occasionally comes up with no “eth1” (renamed to veth…). Have to reboot again.

expand node-4 logs (labeled swarm node where NGinx service is running) ```` Nov 25 13:23:24 node-4 dockerd: time="2017-11-25T13:23:24.575458206+01:00" level=error msg="agent: session failed" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" module=node/agent node.id=4v4391hgoi0ti7d36ob3kzemr Nov 25 13:23:24 node-4 dockerd: time="2017-11-25T13:23:24.575512122+01:00" level=debug msg="agent: rebuild session" module=node/agent node.id=4v4391hgoi0ti7d36ob3kzemr Nov 25 13:23:30 node-4 dockerd: time="2017-11-25T13:23:30.610107317+01:00" level=debug msg="(*session).start" module=node/agent node.id=4v4391hgoi0ti7d36ob3kzemr Nov 25 13:23:30 node-4 dockerd: time="2017-11-25T13:23:30.610349023+01:00" level=error msg="agent: session failed" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" module=node/agent node.id=4v4391hgoi0ti7d36ob3kzemr Nov 25 13:23:30 node-4 dockerd: time="2017-11-25T13:23:30.610454836+01:00" level=debug msg="agent: rebuild session" module=node/agent node.id=4v4391hgoi0ti7d36ob3kzemr ````
expand node-4 debug logs (fdb entry not found) ```` Nov 25 13:54:16 node-4 dockerd: time="2017-11-25T13:54:16.695011960+01:00" level=debug msg="handleEpTableEvent DEL 461c9da05ea4748057cc9db05bea488875c849c237f447de131210df0390c281 R:{stackname-endpoint <nil> 110.10.0.47 [] [] []}" Nov 25 13:54:16 node-4 dockerd: time="2017-11-25T13:54:16.695145464+01:00" level=debug msg="delContainerNameResolution 461c9da05ea4748057cc9db05bea488875c849c237f447de131210df0390c281 stackname-endpoint" Nov 25 13:54:16 node-4 dockerd: time="2017-11-25T13:54:16.695103863+01:00" level=warning msg="Entry was not in db: nid:rf220k3jwjbc3bqrwe8whjob7 eid:461c9da05ea4748057cc9db05bea488875c849c237f447de131210df0390c281 peerIP:110.10.0.47 peerMac:02:42:6e:0a:00:2f isLocal:false vtep:16.23.32.10" Nov 25 13:54:16 node-4 dockerd: time="2017-11-25T13:54:16.695178936+01:00" level=debug msg="461c9da05ea4748057cc9db05bea488875c849c237f447de131210df0390c281 (rf220k3).deleteSvcRecords(stackname-endpoint, 110.10.0.47, <nil>, true) handleEpTableEvent sid:461c9da05ea4748057cc9db05bea488875c849c237f447de131210df0390c281 " Nov 25 13:54:16 node-4 dockerd: time="2017-11-25T13:54:16.695201449+01:00" level=debug msg="checkEncryption(rf220k3, 16.23.32.10, 0, false)" Nov 25 13:54:16 node-4 dockerd: time="2017-11-25T13:54:16.695276836+01:00" level=warning msg="Peer operation failed:could not delete fdb entry for nid:rf220k3jwjbc3bqrwe8whjob7 eid:461c9da05ea4748057cc9db05bea488875c849c237f447de131210df0390c281 into the sandbox:Search neighbor failed for IP 16.23.32.10, mac 02:42:6e:0a:00:2f, present in db:false op:&{2 rf220k3jwjbc3bqrwe8whjob7 461c9da05ea4748057cc9db05bea488875c849c237f447de131210df0390c281 [0 0 0 0 0 0 0 0 0 0 255 255 110 10 0 47] [255 255 0 0] [2 66 110 10 0 47] [0 0 0 0 0 0 0 0 0 0 255 255 172 23 32 210] false false false EventNotify}" ````
expand node-1 (master) debug logs ```` Nov 25 18:24:22 node-1 dockerd: time="2017-11-25T18:24:22.982259281+01:00" level=debug msg="dispatcher committed status update to store" method="(*Dispatcher).processUpdates" module=dispatcher node.id=grdevrj2b2xqtal5qjmmd1evs state.transition="ASSIGNED->PREPARING" task.id=xtrtk2dkbq4xd5o9tgmttrayy Nov 25 18:24:22 node-1 dockerd: time="2017-11-25T18:24:22.982543082+01:00" level=debug msg="dispatcher committed status update to store" method="(*Dispatcher).processUpdates" module=dispatcher node.id=grdevrj2b2xqtal5qjmmd1evs state.transition="FAILED->FAILED" task.id=xmjxiy9ad4f8zgg7ly53lvqoq ````

@parrisvarney thank you for the workaround but I must mention this issue occurs on totally new swarm cluster with docker stack deploy from scratch and redeploy or additional services can cause it again. Removing stack and readding is good for a temporary workaround in development environment only.

Were still looking at some fixes we’d like to get in before releasing; hopefully soon

Looks good so far

  • created an overlay network proxy
  • deployed docker flow proxy stack
  • deployed swarm visualizer stack
  • scale proxy to 3

bildschirmfoto vom 2017-11-23 10-33-29

Just spinning up a test cluster…will report when done

According to the release notes it did https://github.com/docker/docker-ce/releases/tag/v17.11.0-ce

Fix tasks fail with error “Unable to complete atomic operation, key modified” docker/libnetwork#2004

Didn’t checked it yet on my clusters

If anybody is stuck, I’ve had success working around this issue by deleting the stack and network, then recreating both.

docker stack rm my_stack
docker network rm my_network
docker network create --driver overlay my_network
docker stack deploy my_stack

@vieux
Making progress.
I can’t repro at all on windows. I can repro this 100% on linux, which seems odd if this is a timing issue. the issue repros for me with the following: docker swarm init --advertise-addr=<ip> docker network create -d overlay --subnet 192.168.1.0/24 --gateway 192.168.1.1 o1 docker service create --name s1 --replicas 3 --network o1 --publish 8080:80 nginx docker service rm s1 docker service create --name s1 --replicas 3 --network o1 --publish 8080:80 nginx

I’m a little unsure as how I didn’t see this before and/or tests didn’t catch this.
On the bright side, with a fairly consistent repro, I should be able to fix.

I confirm it’s working with “Docker version 17.11.0-ce, build 1caf76c” - thank you.

@pradipd we will release 17.11 very soon and 17.10 was only a edge release, so I think fixing it in 17.11 is enough ? do you have a timeline for the fix ? (RC3 is tentatively scheduled for monday)

Sync’d up with @mavenugo and we have a fix. However, we will bring the fix into 1711. I think the plan is to revert this change in 17.10. @mavenugo @vieux please confirm that is the plan.