portainer: Too many open files

Bug description

I am running a docker swarm with 7 nodes 3 managers and 4 workers. I have deployed portainer via the stack file, with agents and portainer it’s self. After some time an agent instance will report high cpu usage 60-70% and the portainer instance will not load any data about services, stacks etc.

When i look at the log files for the agent with the high CPU i get:

2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files
2018/06/22 13:48:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:7946: accept4: too many open files

Expected behavior I would expect that the portainer instance loads all the data it can, so i can continue to use it as much as possible. An error message explaining that it was unable to contact an agent would also be helpful.

Steps to reproduce the issue: I do not have any specific steps to reproduce it. Just seems to happen randomly.

Technical details:

  • Portainer version: 1.18.0
  • Docker version (managed by Portainer): 18.03.1-ce
  • Platform (windows/linux): Amazon Linux AMI release 2018.03
  • Command used to start Portainer (docker run -p 9000:9000 portainer/portainer): sudo curl -L https://portainer.io/download/portainer-agent-stack.yml -o portainer-agent-stack.yml sudo docker stack deploy --compose-file=portainer-agent-stack.yml portainer
  • Browser: Firefox 60.0.2 (64-bit)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 47 (18 by maintainers)

Most upvoted comments

@deviantony Portainer seems to be running much more stable with the --no-snaspshot added. It’s now up for 4 days without any issues!

Closing via https://github.com/portainer/agent/issues/43

Will be part of the agent version 1.5.0.

I can confirm the dashboard reloading and fd going up behavior. I created a Vagrant+Ansible+AWS Linux2 image setup to reproduce this. v.zip

  1. Unzip the three files in the attachment to a single folder and run vagrant up there.
  2. This would provision two VBox nodes. Note: There should be one Bridged Net adapter attached to each node (portnode1=081027D43801, portnode2=081027D43802). For some reason (Vagrant bug?) sometimes I got node2 with two Bridged adapters instead. If that happens stop the node2 and remove the extra network adapter and start it back up again.
  3. On first run Ansible should print a debug message with Portainer port and details. Eventually re-run later with vagrant provision
  4. vagrant ssh portnode1 to get to the node 1 shell and run:
sudo su
export p=$(docker  inspect --format '{{.State.Pid}}' $(docker ps | grep agent | cut -c-12))
yes 'ls /proc/$p/fd | wc -l; sleep 2' | /usr/bin/sh

This will list the number of Agent’s process fds every 2 seconds.

  1. Go to portainer web portnode2:9000/#/dashboard and click between Dashboard, Stack and back etc.The number of fd would go up aprox 9 new on each refresh…

Note:

  • It seems at least two nodes are required. I couldn’t reproduce this while using a single node only.
  • Ansible 2.8.0+ (dev version) is required because of the Docker modules

@tle211212

I also encounter this bug. My setup is a swarm cluster having 2 nodes (1 master and 1 worker). I observe that each time I load / refresh the dashboard page (ie https://abc.yyy/portainer/#/dashboard), the portainer agent containers have increasing number of open files (/proc/agent proc id/fd). This leak also happens on previous verions: 1.18.1, 1.19.1, 1.19.2 and 1.20.0.

I can’t reproduce this one, I have an agent deployed locally and on a swarm and both file descriptor count stay the same (I tried refreshing the dashboard, changing pages…)

The --no-snapshot flag is not a solution here, it was just used to isolate the cause of the problem.

I believe that #2235 solves this issue, waiting for some feedback from other users before we decide to merge it.