moby: IPv6 address pool subnet smaller than /80 causes dockerd to consume all available RAM

Description

It is documented that IPv6 pool “should” be at least /80 so that MAC address can fit in the last 48 bits.

Using a default-address-pools size larger than 80 causes dockerd to consume too much RAM - the longer the prefix (smaller subnet), the more RAM dockerd will use:

At /81 - /90 range the RAM usage increase is negligible in the range of few GB.
At /94 - /96 range the RAM usage is in the tens to hundreds of GB.

The pool prefix length could be set to larger by 80 by a typing error or a mistake and if this leads to dockerd consuming copious amounts of RAM will cause the administrator to possibly lose time troubleshooting the situation.

It looks like prefix length like /96 is totally unusable and dockerd should refuse to start instead of starting to allocate ridiculous amounts of RAM.

At minimum a warning message should be printed.

Steps to reproduce the issue:

Set IPv6 address pool prefix length longer than 80:

  "default-address-pools": [
    { "base": "192.0.2.0/16", "size": 24 },
    { "base": "2001:db8:1:1f00::/64", "size": 96 }
  ],

Start Docker.
Watch the server grind to a halt and kernel OOM killer being invoked.

Describe the results you received: dockerd consumes very large amounts of RAM (tens of GB).

Describe the results you expected: Either IPv6 pool prefix lengths longer than 80 should work, or dockerd should refuse start with a configuration that cannot be used.

At minimum a warning message should be printed for prefix lengths longer than 80.

The documentation does not mention the RAM usage effect either:

The subnet for Docker containers should at least have a size of /80, so that an IPv6 address can end with the container’s MAC address and you prevent NDP neighbor cache invalidation issues in the Docker layer.

Additional information you deem important (e.g. issue happens only occasionally): 100% reproducible.

Output of docker version:

Docker version 19.03.5, build 633a0ea

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 19.03.5
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: fluentd
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b34a5c8af56e510852c35414db4c1f4fa6172339
 runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
  selinux
 Kernel Version: 3.10.0-1062.4.3.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 24
 Total Memory: 47.15GiB
 Name: docker.domain
 ID: xxx
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

About this issue

Original URL
State: open
Created 5 years ago
Reactions: 5
Comments: 21 (7 by maintainers)

Commits related to this issue

Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
fixup! Fixes #40275: Generate split subnets on-demand — committed to akerouanton/docker by akerouanton a year ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
Fixes #40275: Generate split subnets on-demand This commit resolves #40275 by implementing a custom iterator named NetworkSplitter. It splits a set of NetworkToSplit into smaller subnets on-demand by... — committed to akerouanton/docker by akerouanton 3 years ago
libnet/ipam: Lazily sub-divide pools into subnets A new Subnetter structure is added to lazily sub-divide an address pool into subnets. This fixes #40275. Prior to this change, the list of NetworkTo... — committed to akerouanton/docker by akerouanton a year ago
libnet/ipam: Lazily sub-divide pools into subnets A new Subnetter structure is added to lazily sub-divide an address pool into subnets. This fixes #40275. Prior to this change, the list of NetworkTo... — committed to akerouanton/docker by akerouanton a year ago
too small network doesnt work for docker https://github.com/moby/moby/issues/40275 — committed to Enucatl/puppet-control-repo by Enucatl 9 months ago

Most upvoted comments

if i take swarm out of the equation the example in the compose 3 reference only shows this

I can’t advise on swarm and my experience with GUA network had various gotchas I ran into that I didn’t find time to document that better, but you may find these IPv6 with Docker docs I wrote helpful?

It shows how to setup with Docker CLI or Docker Compose. The official Docker IPv6 docs were in worse shape until recently (May) when they received a big revision (I provided some review feedback). My unofficial docs might provide a helpful resource though 😅

You can definitely create an IPv6 network via the CLI and reference it via compose.yaml. My linked docs should mention that IIRC (NOTE: the link is not entirely stable as it’s waiting on a v13 release of the project, while the linked edge version in future will probably break when the docs are moved around).

Here’s a preview:

If you’re using IPv4 NAT (default), IPv6 ULA works well at providing IPv6 networking that works in the same way between containers and the default enabled userland-proxy.

IPv6 ULA benefit over IPv6 GUA:

Containers aren’t going to like binding to a public IPv4 interface when another container already has for the same public port, which AFAIK made the benefits of IPv6 GUA less useful unless you don’t need to be publicly reachable via IPv4? (presently last I checked you can’t opt-out of IPv4 assignment to a container)
- You could of course also use a reverse-proxy, but I’m not sure why you’d assign each container their own public IPv6 address if a reverse-proxy is in use for IPv4?
IPv6 ULA makes more sense as a preferred default for those that don’t know any better as users less familiar with IPv6 tend to get confused about GUA publicly exposing their containers (or not being accessible due to firewall, while IPv4 / ULA bypass firewall via directly managing iptables rules).

polarathene on Sep 4, 2023

AWS is delegating a prefix up to a /80 per instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-prefix-eni.html#ec2-prefix-basics

GCP is delegating a prefix of /96 per instance: https://cloud.google.com/compute/docs/ip-addresses/configure-ipv6-address#ipv6-assignment

There are other cloud providers that are offering even smaller sizes for their prefix delegations.

Docker should work with smaller allocations too.

digitalresistor on Aug 3, 2021