moby: docker suddenly stops working in swarm mode due to high memory consumption

Description

I have 3 managers in my swarm: dmgr-01, dmgr-02 and dmgr-03. I always execute all commands on dmgr-01 and it was the leader in swarm. But today I’ve got following output for docker service ls command:

runtime/cgo: pthread_create failed: Resource temporarily unavailable                         
SIGABRT: abort                                                                               
PC=0x7f7bc6596067 m=0                                                                        
                                                                                             
goroutine 0 [idle]:                                                                          
                                                                                             
goroutine 1 [running]:                                                                       
runtime.systemstack_switch()                                                                 
        /usr/local/go/src/runtime/asm_amd64.s:252 fp=0xc420020768 sp=0xc420020760            
runtime.main()                                                                               
        /usr/local/go/src/runtime/proc.go:127 +0x6c fp=0xc4200207c0 sp=0xc420020768          
runtime.goexit()                                                                             
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc4200207c8 sp=0xc4200207c0      
                                                                                             
goroutine 17 [syscall, locked to thread]:                                                    
runtime.goexit()                                                                             
        /usr/local/go/src/runtime/asm_amd64.s:2086 +0x1                                      
                                                                                             
rax    0x0                                                                                   
rbx    0x7f7bc6907708                                                                        
rcx    0xffffffffffffffff                                                                    
rdx    0x6                                                                                   
rdi    0x3daf                                                                                
rsi    0x3daf                                                                                
rbp    0xc5a91e                                                                              
rsp    0x7ffe234fb208                                                                        
r8     0xa                                                                                   
r9     0x7f7bc6f47740                                                                        
r10    0x8                                                                                   
r11    0x206                                                                                 
r12    0x1bcd050                                                                             
r13    0xf3                                                                                  
r14    0x30                                                                                  
r15    0x3                                                                                   
rip    0x7f7bc6596067                                                                        
rflags 0x206                                                                                 
cs     0x33                                                                                  
fs     0x0                                                                                   
gs     0x0

And now in docker node ls I see:

ID                           HOSTNAME                  STATUS  AVAILABILITY  MANAGER STATUS
50qxe43jjlwltuwzgem8cwwsa *  dmgr-02                   Ready   Active        Leader
hzykzxzh9jmn5mimj02awr42i    dmgr-01                   Ready   Active        Reachable
wiczce0ey8wzr9zaivlyfgm0f    dmgr-03                   Ready   Active        Reachable

So dmgr-01 stops working and drops leadership in swarm. Also my monitoring tells me, that opened ports flapped on that node.

Additional information:

I looked through my logs on problem node - and found out of memory errors. But you should note, that all my manager nodes have 2 GB of RAM and containers running here consume about 200-300 MB - there are only nginx and consul processes. In normal state I have free -m:

             total       used       free     shared    buffers     cached 
Mem:          2010        350       1659          0         19        173 
-/+ buffers/cache:        157       1852                                  
Swap:            0          0          0

My cluster has 10 nodes and about 70 docker services. So, I think any out of memory errors can be caused only by memory leak in docker engine.

Output of docker version:

Client:                                      
 Version:      17.03.0-ce                    
 API version:  1.26                          
 Go version:   go1.7.5                       
 Git commit:   60ccb22                       
 Built:        Thu Feb 23 10:53:29 2017      
 OS/Arch:      linux/amd64                   
                                             
Server:                                      
 Version:      17.03.0-ce                    
 API version:  1.26 (minimum version 1.12)   
 Go version:   go1.7.5                       
 Git commit:   60ccb22                       
 Built:        Thu Feb 23 10:53:29 2017      
 OS/Arch:      linux/amd64                   
 Experimental: false

Output of docker info:

Containers: 8                  
 Running: 1                    
 Paused: 0                     
 Stopped: 7                    
Images: 10                     
Server Version: 17.03.0-ce     
Storage Driver: aufs                                             
 Root Dir: /var/lib/docker/aufs                                  
 Backing Filesystem: extfs                                       
 Dirs: 54                                                        
 Dirperm1 Supported: true                                        
Logging Driver: json-file                                        
Cgroup Driver: cgroupfs                                          
Plugins:                                                         
 Volume: local                                                   
 Network: bridge host macvlan null overlay                       
Swarm: active                                                    
 NodeID: hzykzxzh9jmn5mimj02awr42i                               
 Is Manager: true                                                
 ClusterID: hqvohft3etj4ajnkgubbnjwzp                            
 Managers: 3                                                     
 Nodes: 15                                                       
 Orchestration:                                                  
  Task History Retention Limit: 5                                
 Raft:                                                           
  Snapshot Interval: 10000                                       
  Number of Old Snapshots to Retain: 0                           
  Heartbeat Tick: 1                                              
  Election Tick: 3                                               
 Dispatcher:                                                     
  Heartbeat Period: 5 seconds                                    
 CA Configuration:                                               
  Expiry Duration: 3 months                                      
 Node Address: <IP of dmgr-01 here>                                 
 Manager Addresses:                                              
  <IP of dmgr-01 here>:2377                                           
  <IP of dmgr-02 here>:2377                                            
  <IP of dmgr-03 here>:2377                                            
Runtimes: runc                                                   
Default Runtime: runc                                            
Init Binary: docker-init                                         
containerd version: 977c511eda0925a723debdc94d09459af49d082a     
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70           
init version: 949e6fa                                            
Kernel Version: 3.16.0-4-amd64                                   
Operating System: Debian GNU/Linux 8 (jessie)                    
OSType: linux                                                    
Architecture: x86_64                                             
CPUs: 2                                                          
Total Memory: 1.963 GiB                                          
Name: dmgr-01                                                    
ID: VUNH:H6FP:CO4N:O6VB:CFCG:4T32:JQQV:SIS3:UGT7:V2FK:46VS:PRKZ  
Docker Root Dir: /var/lib/docker                                 
Debug Mode (client): false                                       
Debug Mode (server): false                                       
Username: filiatixbot                                            
Registry: https://index.docker.io/v1/                            
WARNING: No memory limit support                                 
WARNING: No swap limit support                                   
WARNING: No kernel memory limit support                          
WARNING: No oom kill disable support                             
WARNING: No cpu cfs quota support                                
WARNING: No cpu cfs period support                               
Experimental: false                                              
Insecure Registries:                                             
 127.0.0.0/8                                                     
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

I use virtual server on DigitalOcean.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 23 (12 by maintainers)

Most upvoted comments

Ok, so it looks like GELF was unable to send logs, possibly leading to it buffering messages until it’s able to send them.

ping @mariussturm @cpuguy83 any ideas / suggestions?

thaJeztah on Mar 20, 2017