balena-engine: Containers drop off bridge networks unexpectedly
Description
In support we have seen some recent cases where containers are removed from the bridge network unexpectedly.
Steps to reproduce the issue:
TBD
Describe the results you received:
balena inspect ${CONTAINER_ID}will still show up on the proper network.balena network inspect ${CONTAINER_ID}the network does not include the container
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
Issue happens frequently since upgrading from v2.58.4 but can be resolved by restarting the container.
Output of balena-engine version:
Client:
Version: 19.03.13-dev
API version: 1.40
Go version: go1.12.17
Git commit: 074a481789174b4b6fd2d706086e8ffceb72e924
Built: Mon Feb 1 20:12:05 2021
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 19.03.13-dev
API version: 1.40 (minimum version 1.12)
Go version: go1.12.17
Git commit: 074a481789174b4b6fd2d706086e8ffceb72e924
Built: Mon Feb 1 20:12:05 2021
OS/Arch: linux/amd64
Experimental: true
containerd:
Version: 1.2.0+unknown
GitCommit:
runc:
Version:
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
balena-engine-init:
Version: 0.13.0
GitCommit: 949e6fa-dirty
Output of balena-engine info:
Client:
Debug Mode: false
Server:
Containers: 12
Running: 12
Paused: 0
Stopped: 0
Images: 15
Server Version: 19.03.13-dev
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 331
Dirperm1 Supported: true
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
Volume: local
Network: bridge host null
Log: journald json-file local
Swarm:
NodeID:
Is Manager: false
Node Address:
Runtimes: bare runc
Default Runtime: runc
Init Binary: balena-engine-init
containerd version:
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: 949e6fa-dirty (expected: fec3683b971d9)
Kernel Version: 5.8.18-yocto-standard
Operating System: balenaOS 2.68.1+rev1
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.691GiB
Name: 40623112549f.videolink.io
ID: ODX3:BQOU:LFIR:MXUE:WZX4:L37E:BY42:VT3K:6AZL:FDZI:RDVT:UQ2B
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (device type, OS, etc.):
ID="balena-os"
NAME="balenaOS"
VERSION="2.68.1+rev1"
VERSION_ID="2.68.1+rev1"
PRETTY_NAME="balenaOS 2.68.1+rev1"
MACHINE="genericx86-64"
VARIANT="Production"
VARIANT_ID=prod
META_BALENA_VERSION="2.68.1"
RESIN_BOARD_REV="cd52766"
META_RESIN_REV="e658a4e"
SLUG="intel-nuc"
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 17 (2 by maintainers)
I reviewed the 7 support tickets we have attached to this issue. We don’t usually have all the data to check if they were cases of Engine startup timeouts, but for 4 tickets (from two different users) there is strong evidence this was indeed the case. Two other tickets (from two other users) were more difficult to analyze and probably involved more than one single issue – but we have noticed unexpected behavior after reboot (in of them, it caught my attention the application was using 11 containers, which could translate to a higher Engine startup time).
Still, very importantly, in one ticket we have some good evidence that a container got dropped off the network without a reboot or Engine restart (the Engine had a 49-day uptime on this case).
So, startup timeouts seem to be a common cause of this issue, but not the only one.