weave: Weave-Net Addon causing kernel panics on RPI 3B+.
What you expected to happen?
What I expected to happen: upon two weave pods discovering each other, weave to start working.
What happened?
The weave pod seems to execute some command that causes one of the two nodes connecting to each other to crash with a kernel panic. I’m guessing it’s unlikely that weave itself is the root cause here, but I figure here is a good place to start.
Logs from the kernel are at the end of this issue.
How to reproduce it?
Setup kubernetes 1.9.7 on two nodes, and apply the K8s Weave addon using
$ kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
- this seems to install version 2.3.0 according to the pod descriptions from the API.
Wait a short amount of time for the pods to try and connect to each other.
Notice that one of the machines has rebooted, and the other is unable to connect to the first (as it has crashed).
Anything else we need to know?
Probably the most important part: both nodes in this case are running the latest version of Raspbian, as they are Raspberry Pi 3B+ machines. They are all located on a home network, with IPs 192.168.0.3-5, statically assigned. These are configured using Ansible to an extent, and I might be able to share the scripts used if needed.
Versions:
$ weave version (found by exec-ing to a running pod awaiting connections)
/home/weave # ./weave --local version
weave 2.3.0
$ docker version
Docker version 18.04.0-ce, build 3d479c0
$ uname -a
Linux m1 4.14.34-v7+ #1110 SMP Mon Apr 16 15:18:51 BST 2018 armv7l GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T10:09:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-18T23:58:35Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/arm"}
Logs:
Before one node connects to the other, everything looks mostly fine. The initial connections are attempted to each of the three peers - .3, .4, and .5.
$ kubectl logs -n kube-system <weave-net-pod> weave
DEBU: 2018/06/11 05:43:56.878318 [kube-peers] Checking peer "ee:bf:2b:a9:06:ad" against list &{[{42:fc:dc:59:ea:96 m1} {3e:39:75:92:f1:9b m3} {02:f9:0c:1b:52:04 m3} {ee:bf:2b:a9:06:ad m1}]}
INFO: 2018/06/11 05:43:56.993279 Command line options: map[ipalloc-init:consensus=3 ipalloc-range:10.32.0.0/12 name:ee:bf:2b:a9:06:ad nickname:m1 datapath:datapath db-prefix:/weavedb/weave-net docker-api: host-root:/host http-addr:127.0.0.1:6784 metrics-addr:0.0.0.0:6782 no-dns:true port:6783 conn-limit:100 expect-npc:true]
INFO: 2018/06/11 05:43:56.993509 weave 2.3.0
INFO: 2018/06/11 05:43:57.589147 Bridge type is bridged_fastdp
INFO: 2018/06/11 05:43:57.589233 Communication between peers is unencrypted.
INFO: 2018/06/11 05:43:57.625030 Our name is ee:bf:2b:a9:06:ad(m1)
INFO: 2018/06/11 05:43:57.625248 Launch detected - using supplied peer list: [192.168.0.3 192.168.0.4 192.168.0.5]
INFO: 2018/06/11 05:43:57.628632 Checking for pre-existing addresses on weave bridge
INFO: 2018/06/11 05:43:57.645824 [allocator ee:bf:2b:a9:06:ad] Initialising with persisted data
INFO: 2018/06/11 05:43:57.651091 Sniffing traffic on datapath (via ODP)
INFO: 2018/06/11 05:43:57.662415 ->[192.168.0.3:6783] attempting connection
INFO: 2018/06/11 05:43:57.674431 ->[192.168.0.4:6783] attempting connection
INFO: 2018/06/11 05:43:57.674875 ->[192.168.0.5:6783] attempting connection
INFO: 2018/06/11 05:43:57.675066 ->[192.168.0.3:57939] connection accepted
INFO: 2018/06/11 05:43:57.675843 ->[192.168.0.4:6783] error during connection attempt: dial tcp4 :0->192.168.0.4:6783: connect: connection refused
INFO: 2018/06/11 05:43:57.676161 ->[192.168.0.5:6783] error during connection attempt: dial tcp4 :0->192.168.0.5:6783: connect: connection refused
INFO: 2018/06/11 05:43:57.679452 ->[192.168.0.3:57939|ee:bf:2b:a9:06:ad(m1)]: connection shutting down due to error: cannot connect to ourself
INFO: 2018/06/11 05:43:57.680866 ->[192.168.0.3:6783|ee:bf:2b:a9:06:ad(m1)]: connection shutting down due to error: cannot connect to ourself
INFO: 2018/06/11 05:43:57.699266 Listening for HTTP control messages on 127.0.0.1:6784
INFO: 2018/06/11 05:43:57.700143 Listening for metrics requests on 0.0.0.0:6782
INFO: 2018/06/11 05:43:58.690696 [kube-peers] Added myself to peer list &{[{42:fc:dc:59:ea:96 m1} {3e:39:75:92:f1:9b m3} {02:f9:0c:1b:52:04 m3} {ee:bf:2b:a9:06:ad m1}]}
DEBU: 2018/06/11 05:43:58.703575 [kube-peers] Nodes that have disappeared: map[]
INFO: 2018/06/11 05:43:59.078850 ->[192.168.0.4:6783] attempting connection
INFO: 2018/06/11 05:43:59.080194 ->[192.168.0.4:6783] error during connection attempt: dial tcp4 :0->192.168.0.4:6783: connect: connection refused
INFO: 2018/06/11 05:43:59.315638 ->[192.168.0.5:6783] attempting connection
INFO: 2018/06/11 05:43:59.316856 ->[192.168.0.5:6783] error during connection attempt: dial tcp4 :0->192.168.0.5:6783: connect: connection refused
INFO: 2018/06/11 05:44:02.806629 ->[192.168.0.5:6783] attempting connection
INFO: 2018/06/11 05:44:02.808304 ->[192.168.0.5:6783] error during connection attempt: dial tcp4 :0->192.168.0.5:6783: connect: connection refused
INFO: 2018/06/11 05:44:03.554820 ->[192.168.0.4:6783] attempting connection
Once one pod connects to another, it’s random as to which one crashes (but one always does). All the “logs” of the kernel panic that I could get hold of are here:
Jun 11 05:28:31 m1 kubelet[785]: I0611 05:28:31.922238 785 kubelet.go:2118] Container runtime status: Runtime Cond
itions: RuntimeReady=true reason: message:, NetworkReady=true reason: message:
Jun 11 05:28:32 m1 kernel: [ 2162.741648] Unable to handle kernel NULL pointer dereference at virtual address 00000000
Jun 11 05:28:32 m1 kernel: [ 2162.744896] pgd = 921c0000
Jun 11 05:28:32 m1 kernel: [ 2162.747989] [00000000] *pgd=19c2c835, *pte=00000000, *ppte=00000000
Message from syslogd@m1 at Jun 11 05:28:32 ...
kernel:[ 2162.751060] Internal error: Oops: 80000007 [#1] SMP ARM
Jun 11 05:28:32 m1 kernel: [ 2162.751060] Internal error: Oops: 80000007 [#1] SMP ARM
Jun 11 05:28:32 m1 kernel: [ 2162.754243] Modules linked in: xt_NFLOG veth dummy vport_vxlan vxlan ip6_udp_tunnel udp_
tunnel openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 nfnetlink_log xt_statistic xt_nat xt_recent ipt_REJECT
nf_reject_ipv4 xt_tcpudp ip_set_hash_ip xt_set ip_set xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_con
ntrack_netlink nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntr
ack nf_nat nf_conntrack br_netfilter bridge stp llc overlay cmac bnep hci_uart btbcm serdev bluetooth ecdh_generic evd
ev joydev sg brcmfmac brcmutil cfg80211 rfkill snd_bcm2835(C) snd_pcm snd_timer snd uio_pdrv_genirq fixed uio ip_table
s x_tables ipv6
Message from syslogd@m1 at Jun 11 05:28:32 ...
kernel:[ 2162.816788] Process weaver (pid: 1896, stack limit = 0x92190210)
Message from syslogd@m1 at Jun 11 05:28:32 ...
kernel:[ 2162.820871] Stack: (0x921919f0 to 0x92192000)
Message from syslogd@m1 at Jun 11 05:28:32 ...
kernel:[ 2162.824936] 19e0: 00000000 00000000 0500a8c0 92191a88
Message from syslogd@m1 at Jun 11 05:28:32 ...
kernel:[ 2162.829076] 1a00: 0000801a 00008bad b88a5bd0 b88a5b98 92191d2c 7f637ad0 00000001 92191a5c
Message from syslogd@m1 at Jun 11 05:28:32 ...
kernel:[ 2162.833081] 1a20: 23d23b00 00000000 b88a5bd0 99ed4000 00000050 b406e000 00000000 99ed4050
Message from syslogd@m1 at Jun 11 05:28:32 ...
kernel:[ 2162.837243] 1a40: 00000000 00008bad 00000040 0000801a 92191a68 00002100 00000000 00000000
Message from syslogd@m1 at Jun 11 05:28:32 ...
kernel:[ 2162.841436] 1a60: 00008000 0000ee47 00000002 0500a8c0 00000000 00000000 00000000 00000000
Message from syslogd@m1 at Jun 11 05:28:32 ...
kernel:[ 2162.845629] 1a80: 00000000 00000000 0300a8c0 00000000 00000000 00000000 00000000 00000000
Shared connection to m1 closed.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 25 (13 by maintainers)
Just checked, and indeed, the latest raspbian is prone to the kernel bug mentioned in my comment above, as https://github.com/raspberrypi/linux/tree/raspberrypi-kernel_1.20180417-1 misses the fix: https://github.com/torvalds/linux/commit/f15ca723c1ebe6c1a06bc95fda6b62cd87b44559#diff-4f541554c5f8f378effc907c8f0c9115.
To workaround, you can disable
fastdp
by re-deploying your cluster withkubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')&env.WEAVE_NO_FASTDP=1"
. You might need to remove the “weave” interface if it exists before the re-deploy.Martynas,
I updated my raspberry pi’s to use the latest kernel provided by running
rpi-update
(4.14.52-v7+ in my case), adjusted my weave daemonset to re-enable fastdp, and restarted all my nodes and everything works fine - weave-kube reports it’s using bridged_fastdp and none of my nodes crash.One more confirmation: I upgrade my Raspbian with a standard “apt upgrade” which upgraded my kernel. Everything works now as expected.
@arnulfojr
Thanks.
The non-
fastdp
mode which also known assleeve
is slower and consumes more CPU cycles. Please see for more details: https://www.weave.works/blog/weave-docker-networking-performance-fast-data-path/Agreed. We are going to update our docs to include known issues.