podman: IPv4 Default Route Does Not Propagate to Pasta Containers on Hetzner VPSes

Issue Description

When using the pasta networking back-end in Podman on a Hetzner cloud VPS, the IPv4 routing table visible to the container is missing the entry for the host system’s configured gateway server, as well as the associated default route entry. As a result of this, the container has no ability to reach the internet via IPv4. Oddly enough, the IPv6 routing table appears to be complete, and there do not seem to be any issues with IPv6 connectivity inside the container.

This issue does not occur on my home network - neither my home computer (Arch Linux) nor a Cortex A53 development board (Fedora CoreOS) exhibit this issue. But I have been able to reproduce this consistently on Hetzner VPSes, under both Arch Linux and Fedora CoreOS, as well as under Hetzner’s customised Fedora Server distribution.

There is also no such issue when using Podman’s default network… but then I don’t get the benefits of using pasta.

Steps to reproduce the issue

$ ip route
$ podman run -it --rm --network=pasta alpine sh
# ip route
# ping -c 1 8.8.8.8

Describe the results you received

The ip route call on the host system outputs the system’s routing table. On my current VPS, this is:

$ ip route
default via 172.31.1.1 dev ens3 proto static metric 100
<first 3 octets of the VPS' public IPv4 address>/24 dev ens3 proto kernel scope link src <VPS' IPv4 address> metric 100
172.31.1.1 dev ens3 proto static scope link metric 100

The ip route call inside the container outputs a table without entries for the gateway and default route, i.e.

# ip route
<first 3 octets of the VPS' public IPv4 address>/24 dev ens3 proto kernel scope link src <VPS' IPv4 address> metric 100

As a result, attempts to reach the internet using IPv4 fail. IPv6 connectivity has no issues.

# ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: sendto: Network unreachable
# ping -c 1 2404:e80::1337:af
PING 2404:e80::1337:af (2404:e80::1337:af): 56 data bytes
64 bytes from 2404:e80::1337:af: seq=0 ttl=255 time=257.822 ms

--- 2404:e80::1337:af ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 257.822/257.822/257.822 ms

NOTE: The <first 3 octets of the VPS' public IPv4 address> entry in the routing table is the result of an error in my static network configuration. I have left it in the above output because it demonstrates that parts of the routing table do get propagated to the container, just not the gateway and default route. In a correct configuration where that extraneous entry is not present in the host system’s routing table, the container’s routing table is just empty.

$ ip route
default via 172.31.1.1 dev ens3 proto static metric 100
172.31.1.1 dev ens3 proto static scope link metric 100
$ podman run -it --rm --network=pasta alpine sh
# ip route
<no output>

Describe the results you expected

I expected a complete routing table to be available inside the container, as is the case on my home network.

$ ip route
default via 192.168.0.1 dev end0 proto dhcp src 192.168.0.11 metric 100 
192.168.0.0/24 dev end0 proto kernel scope link src 192.168.0.11 metric 100 
$ podman run -it --rm --network=pasta alpine sh
# ip route show
default via 192.168.0.1 dev end0 
192.168.0.0/24 dev end0 scope link  src 192.168.0.11
# ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=42 time=22.281 ms

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 22.281/22.281/22.281 ms

podman info output

$ podman info
host:
  arch: amd64
  buildahVersion: 1.30.0
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: /usr/bin/conmon is owned by conmon 1:2.1.7-1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: f633919178f6c8ee4fb41b848a056ec33f8d707d'
  cpuUtilization:
    idlePercent: 99.54
    systemPercent: 0.21
    userPercent: 0.25
  cpus: 1
  databaseBackend: boltdb
  distribution:
    distribution: arch
    version: unknown
  eventLogger: journald
  hostname: neoninteger-test-server
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 6.3.1-arch2-1
  linkmode: dynamic
  logDriver: journald
  memFree: 1747542016
  memTotal: 2017873920
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: /usr/bin/crun is owned by crun 1.8.3-2
    path: /usr/bin/crun
    version: |-
      crun version 1.8.3
      commit: 59f2beb7efb0d35611d5818fd0311883676f6f7e
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns is owned by slirp4netns 1.2.0-1
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.4
  swapFree: 0
  swapTotal: 0
  uptime: 1h 3m 54.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /home/neoninteger/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/neoninteger/.local/share/containers/storage
  graphRootAllocated: 19977711616
  graphRootUsed: 2542403584
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 1
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/neoninteger/.local/share/containers/storage/volumes
version:
  APIVersion: 4.5.0
  Built: 1681856273
  BuiltTime: Wed Apr 19 07:47:53 2023
  GitCommit: 75e3c12579d391b81d871fd1cded6cf0d043550a-dirty
  GoVersion: go1.20.3
  Os: linux
  OsArch: linux/amd64
  Version: 4.5.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

Reproduced in the following environments:

  • Hetzner CPX21 running Fedora CoreOS (this server is now busy with something else and cannot be used for further testing)
  • Hetzner CX11 running Fedora Server with Hetzner customisations (e.g. networking configured using cloud-init)
  • Hetzner CX11 running Arch Linux (current deployment)

The CoreOS and Arch instances were custom OS deployments that did not use Hetzner’s cloud-init system. In these systems, I tested both DHCP and manual IPv4 configuration, the issue occurs in both configurations.

The issue does not occur on my home network, with either of the following devices:

  • MacBookPro11,1 (Arch Linux)
  • Hardkernel ODROID-C4 (Fedora CoreOS, aarch64 variant)

Additional information

It should be noted that the routing table being incomplete seems to be the only issue here. If I manually add the rules from the host system, IPv4 connectivity within the container seems to work.

$ podman run -it --rm --network=pasta --cap-add=NET_ADMIN alpine sh
# ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: sendto: Network unreachable
# ip route add 172.31.1.1 dev ens3
# ip route add default via 172.31.1.1
# ping -c 1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=42 time=7.861 ms

--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 7.861/7.861/7.861 ms

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 19 (6 by maintainers)

Commits related to this issue

Most upvoted comments

Apologies for the delay. I just re-tested against the HEAD version (e3b1953) and as best I can tell, everything still works.

Thanks! And we finally have a version with the changes, 2023_06_03.429e1a7 (not yet in Arch Linux, in testing for Fedora 38).

@sbrivio-rh Did some basic testing on HEAD and seems to work. Thank you! 👍

$ CONTAINERS_HELPER_BINARY_DIR=. podman run -it --rm --network=pasta alpine sh
/ # ping -c3 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=42 time=9.303 ms
64 bytes from 8.8.8.8: seq=1 ttl=42 time=5.638 ms
64 bytes from 8.8.8.8: seq=2 ttl=42 time=6.336 ms

--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 5.638/7.092/9.303 ms
/ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN qlen 1000
    link/ether 0a:f2:72:b7:17:21 brd ff:ff:ff:ff:ff:ff
    inet 162.55.38.218/32 scope global dynamic noprefixroute eth0
       valid_lft 85823sec preferred_lft 85823sec
    inet6 2a01:4f8:c17:8d14::1/64 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::8f2:72ff:feb7:1721/64 scope link 
       valid_lft forever preferred_lft forever
/ # ip route
default via 172.31.1.1 dev eth0  src 162.55.38.218  metric 100 
172.31.1.1 dev eth0 scope link  src 162.55.38.218  metric 100 

I’ve just spent some time testing the patch series on the mailing list with the additional patch provided in the above comments on both my home network and VPS. I’ve tried models of all of the scenarios I envision using in the near-future, including:

  • containers being able to reach the internet and other local network devices
  • other devices on the local network and internet being able to reach forwarded TCP and UDP ports
  • linking ports between containers on the same machine
  • all of the above on both IPv4 and IPv6

As far as I can tell, all of the pasta functionality I intend to use works.

I don’t know whether David’s thoughts will end up changing anything or not. In case further revision of the patch series occurs, I’ll stay subscribed to the mailing list and test any new relevant proposals as time allows.

Thank you for taking the time to work on this. pasta is proving to be an excellent solution for container networking and I look forward to continuing its use.

If any Podman maintainers wish to close this issue early, then please do so. If not, I’ll close it myself once a relevant patch series has been applied and is being made available by distribution package repositories.

Running Podman 4.5.0 with passt 96f8d55c4f5093fa59c168361c0428b53b6d2d06 with the patchset and the following additional patch applied (new server, but weirdly enough I got the same IP address again):

diff --git a/netlink.c b/netlink.c
index 70218cd..81f2415 100644
--- a/netlink.c
+++ b/netlink.c
@@ -319,6 +319,13 @@ next:
                nh = (struct nlmsghdr *)buf;
                nl_req(1, resp, nh, nlmsgs_size);
        }
+
+        if (op == NL_DUP) {
+                char resp[NLBUFSIZ];
+
+                nh = (struct nlmsghdr *)buf;
+                nl_req(1, resp, nh, nlmsgs_size);
+        }
 }
 
 /**
$ CONTAINERS_HELPER_BINARY_DIR=. podman run -it --rm --network=pasta alpine sh
/ # ping -c3 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=42 time=8.112 ms
64 bytes from 8.8.8.8: seq=1 ttl=42 time=5.616 ms
64 bytes from 8.8.8.8: seq=2 ttl=42 time=5.375 ms

--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 5.375/6.367/8.112 ms
/ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN qlen 1000
    link/ether 4e:92:08:f9:51:6d brd ff:ff:ff:ff:ff:ff
    inet 49.13.6.54/32 scope global dynamic noprefixroute eth0
       valid_lft 85688sec preferred_lft 85688sec
    inet6 2a01:4f8:c010:820f::1/64 scope global noprefixroute 
       valid_lft forever preferred_lft forever
    inet6 fe80::4c92:8ff:fef9:516d/64 scope link 
       valid_lft forever preferred_lft forever
/ # ip route
default via 172.31.1.1 dev eth0  src 49.13.6.54  metric 100 
172.31.1.1 dev eth0 scope link  src 49.13.6.54  metric 100 
/ # 

Works! Thank you very much @sbrivio-rh ❤️

I just posted a series that should fix this by optionally copying all the routes (and addresses) associated to the selected interface on pasta --config-net (enabled by default).

It turned out that it’s actually simpler to do that, rather than trying to figure out if we can happily copy a single route or if we should copy more than one.

I haven’t tried particularly hard to replicate the problematic setup, though, so testing and feedback would be very appreciated. Thanks!

Since there is now credible evidence that this is not specifically a Podman issue, should further discussion be moved to Bugzilla, or is keeping it here okay?

I would keep it here to avoid unnecessary indirection.

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 96:00:02:2d:59:8e brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    inet 65.108.218.67/32 scope global dynamic noprefixroute ens3
       valid_lft 86347sec preferred_lft 86347sec
    inet6 2a01:4f9:c011:8389::1/64 scope global noprefixroute
       valid_lft forever preferred_lft forever
    inet6 fe80::9400:2ff:fe2d:598e/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
$ ip route
default via 172.31.1.1 dev ens3 proto dhcp src 65.108.218.67 metric 100
172.31.1.1 dev ens3 proto dhcp scope link src 65.108.218.67 metric 100

Groan.

Thank you for posting the link to the GCE patch as the symptoms and cause do seem to be similar. The IP address provided by Hetzner’s DHCP system does indeed have a /32 sub-net mask, implying that the gateway cannot be on the same logical sub-net. In fact, the assigned IPv4 address is very different to the provided gateway address.

65.108.218.67 = 01000001 01101100 11011010 01000011
172.31.1.1    = 10101100 00011111 00000001 00000001

There is no possible sub-net mask for which these addresses match. So the algorithm in the patch you linked would run for all 32 iterations, shrinking the sub-net mask all the way to 0, eventually finding that the shifted addresses still don’t match, and just leaving the original, broken /32 mask.

Right. I wonder why nowadays there seems to be an expectation for that kind of configuration to be in any way sane. In my opinion it spectacularly clashes with RFC 791 section 2.2:

The internet module prepares a datagram header and attaches the data to it. The internet module determines a local network address for this internet address, in this case it is the address of a gateway.

(emphasis mine). Whatever, we can’t just go and “fix” that in all the possible environments.

Unless there are other preferences, I would add a workaround that configures, at least for those cases, or if there’s a default route for our outbound interface without gateway, an equivalent route in the detached namespace. In your case, that would be the 172.31.1.1 dev ens3 scope link just like dhclient sets it up. We need to modify nl_route() in netlink.c a bit to support gateway-less routes (but we’ll need that anyway to cover this case).

Feel free to send a patch, or I might get to it later today (presumably tomorrow for you).