colima: Network in containers breaks under bigger network load

Network breaks in containers when they start multiple network connections at the same time.

I noticed this behaviour e.g. during downloading Python dependencies. When multiple packages are downloaded at the same time I start getting Network is unreachable error. Then when I login to the underlying QEMU machine (limactl shell colima) I can see that it can’t reach any network address. I cannot even ping 8.8.8.8. My host computer doesn’t have any connection issues.

It gets better after few moments of inactivity. Restarting QEMU machine (colima stop && colima start) fixes the network, but the problem comes back when I increase the network load.

This is a problem that I can consistently reproduce. I created a minimum setup to demonstrate it: https://github.com/mjkonarski-b/colima-poc

I experience that problem on multiple Macbooks, so it doesn’t seem to be related to any particular processor or macOS version:

  • MBP 2021 M1 Pro with 12.1 Monterey
  • MBP 2019 i7 with 12.1 Monterey
  • MBP 2019 i7 with 11.5.2 BigSur
$ colima version
colima version 0.3.2
git commit: 272db4732b90390232ed9bdba955877f46a50552

runtime: docker
arch: aarch64
client: v20.10.11
server: v20.10.11


$ limactl --version
limactl version 0.8.1


$ qemu-img --version
qemu-img version 6.2.0
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 21
  • Comments: 47 (18 by maintainers)

Most upvoted comments

I was having this issue and was able to work around it by adding the following to ~/.lima/_config/override.yaml

useHostResolver: false
dns:
- 8.8.8.8

Just an FYI that I have made notable progress with this.

Going with PTP based networking (thanks @elventear) minimised the dependencies required to only vde_vmnet. It then turned out easy to bundle with Colima due to its small size.

In addition to fixing this issue (hopefully finally), all VMs also get IP addresses that are reachable from the host, which then fixes https://github.com/abiosoft/colima/issues/189, https://github.com/abiosoft/colima/issues/97, https://github.com/abiosoft/colima/issues/71 and provides a workaround for https://github.com/abiosoft/colima/issues/135.

Hi, Same issue here on 5 different MBP machines.

When pulling multiple images at the same time with docker-compose the network breaks and I get unreachable error or i/o timeout.

Great if the problem could be addressed soon.

$ colima version 0.3.2
git commit: 272db4732b90390232ed9bdba955877f46a50552

runtime: docker
arch: x86_64
client: v20.10.12
server: v20.10.11

$ limactl --version                                                                                                                                                                                          
limactl version 0.8.1

$ qemu-img --version                                                                                                                                                                                         
qemu-img version 6.2.0
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

I did more investigation, but I couldn’t find the root cause. So far it seems that the problem lies in Lima or QEMU itself. I could reproduce it on machines running raw Lima images, without Colima. I found two issues in Lima repo that seem describing the very same problem: https://github.com/lima-vm/lima/issues/537 https://github.com/lima-vm/lima/issues/561

@jasoncodes launchd is used mainly to keep it as background running process. I can borrow from the approach used by Lima or find a way to tie it to the qemu process.

Thanks, your feedbacks have been helpful.

Yes, I’m more than happy to test any development branches you may have. Looking forward to having a release with built-in support for VDE networking. Thanks for your great work. 😃

Aside: Is there a documented uninstall process anywhere? Prior to this a colima delete on all profiles (followed by a brew uninstall) would clean everything up. Now we also have /opt/colima which is not automatically removed. Might be worth adding something to the README?

I just gave this a go with HEAD-5e2e413 and initially got the following output during colima start:

INFO[0000] preparing network ...                         context=vm
WARN[0015] error starting network: error at 'preparing network': stat /Users/jason/.colima/network/vmnet.ptp: no such file or directory  context=vm

~/.colima/network/vmnet.stderr contained the following:

sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper
sudo: a password is required

After reviewing the generated ~/Library/LaunchAgents/com.abiosoft.colima.colima.plist file, I created /etc/sudoers.d/colima with the following:

%admin ALL=(ALL) NOPASSWD: /opt/colima/bin/colima-vmnet start colima

colima start now runs cleanly. lima0 is setup as 192.168.106.2 and is the default IPv4 route. Outbound TCP and ICMP are working well.

Edit: See https://github.com/abiosoft/colima/issues/140#issuecomment-1073375002. I had a custom /etc/sudoers.d/colima. Removing this file fixes thing.


DNS is still using the user mode network which I have found to be unreliable with some DNS-heavy loads, even when all other traffic is routing via lima0. I’m using the following ~/.lima/_config/override.yaml to use lima0 for DNS:

useHostResolver: false
dns:
  - 192.168.106.1

With a couple more optional tweaks I can also get direct IP access to containers from the host:

sudo route -n add -net 172.17.0.0/16 192.168.106.2
colima ssh -- sudo iptables -A FORWARD -i lima0 -j ACCEPT

The following in /etc/docker/daemon.json (along with sudo rc-service docker restart) ensures Docker Compose networks use 172.17.0.0/16 too, avoiding having to add additional host routes for these Docker networks:

  "default-address-pools": [
    {
      "base": "172.17.0.0/16",
      "size": 24
    }
  ]

@abiosoft I have noticed that after running for a while eth0 got added back as a default route, I assume due to some network or power event. I am thinking a solution is to disable eth0 being considered for a default route. Seems the right way to do this in alpine is:

echo 'NO_GATEWAY="eth0"' >> /etc/udhcpd.conf

Currently testing this.

I dug deeper into this issue I have been able to work around it within lima using PTP based networking as reported in lima-vm/lima#724. It would be nice to able to make this all work seamlessly without manually managing the colima template or the vde_vmnet process.

With this workaround, I was able to get the desired result with this test https://github.com/abiosoft/colima/issues/140#issuecomment-1025634300.

I will keep an eye on the upstream issue. And in the meantime I will look at implementing this workaround in Colima.

I dug deeper into this issue I have been able to work around it within lima using PTP based networking as reported in lima-vm/lima#724. It would be nice to able to make this all work seamlessly without manually managing the colima template or the vde_vmnet process.

One half solution is to add in ~/.lima/_config/override.yaml the following:

---
networks:
   - vnl: "/tmp/vde.ptp"
     switchPort: 65535

To inject the PTP network into the colima image without changing the template, but it will require manually starting the vde_vmnet process and deleting the default route going through the SLIRP network.

I think it is more related to this https://github.com/lima-vm/lima/issues/561. It is specific to macOS and not reproducible on Linux, makes me think it is something to do with macOS networking.