balena-engine: Containers drop off bridge networks unexpectedly

Description

In support we have seen some recent cases where containers are removed from the bridge network unexpectedly.

Steps to reproduce the issue:

TBD

Describe the results you received:

  • balena inspect ${CONTAINER_ID} will still show up on the proper network.
  • balena network inspect ${CONTAINER_ID} the network does not include the container

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

Issue happens frequently since upgrading from v2.58.4 but can be resolved by restarting the container.

Output of balena-engine version:

Client:
 Version:           19.03.13-dev
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        074a481789174b4b6fd2d706086e8ffceb72e924
 Built:             Mon Feb  1 20:12:05 2021
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          19.03.13-dev
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       074a481789174b4b6fd2d706086e8ffceb72e924
  Built:            Mon Feb  1 20:12:05 2021
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.2.0+unknown
  GitCommit:        
 runc:
  Version:          
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 balena-engine-init:
  Version:          0.13.0
  GitCommit:        949e6fa-dirty

Output of balena-engine info:

Client:
 Debug Mode: false

Server:
 Containers: 12
  Running: 12
  Paused: 0
  Stopped: 0
 Images: 15
 Server Version: 19.03.13-dev
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 331
  Dirperm1 Supported: true
 Logging Driver: journald
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host null
  Log: journald json-file local
 Swarm: 
  NodeID: 
  Is Manager: false
  Node Address: 
 Runtimes: bare runc
 Default Runtime: runc
 Init Binary: balena-engine-init
 containerd version: 
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: 949e6fa-dirty (expected: fec3683b971d9)
 Kernel Version: 5.8.18-yocto-standard
 Operating System: balenaOS 2.68.1+rev1
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.691GiB
 Name: 40623112549f.videolink.io
 ID: ODX3:BQOU:LFIR:MXUE:WZX4:L37E:BY42:VT3K:6AZL:FDZI:RDVT:UQ2B
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (device type, OS, etc.):

ID="balena-os"
NAME="balenaOS"
VERSION="2.68.1+rev1"
VERSION_ID="2.68.1+rev1"
PRETTY_NAME="balenaOS 2.68.1+rev1"
MACHINE="genericx86-64"
VARIANT="Production"
VARIANT_ID=prod
META_BALENA_VERSION="2.68.1"
RESIN_BOARD_REV="cd52766"
META_RESIN_REV="e658a4e"
SLUG="intel-nuc"

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 17 (2 by maintainers)

Most upvoted comments

I reviewed the 7 support tickets we have attached to this issue. We don’t usually have all the data to check if they were cases of Engine startup timeouts, but for 4 tickets (from two different users) there is strong evidence this was indeed the case. Two other tickets (from two other users) were more difficult to analyze and probably involved more than one single issue – but we have noticed unexpected behavior after reboot (in of them, it caught my attention the application was using 11 containers, which could translate to a higher Engine startup time).

Still, very importantly, in one ticket we have some good evidence that a container got dropped off the network without a reboot or Engine restart (the Engine had a 49-day uptime on this case).

So, startup timeouts seem to be a common cause of this issue, but not the only one.