moby: "tls: bad record MAC" on Qemu Risc-V VM

Description

While trying to docker pull some images, Docker logs multiple local error: tls: bad record MAC errors into the Journal and keep retrying layers to pull the images. Eventually they get pulled.

...
Nov 24 13:32:51 qemuriscv dockerd[430]: time="2019-11-24T13:32:51.589142373Z" level=error msg="fatal task error" error="No such container: func_gateway.1.s97f0g29121agb0ew8tegqc95" module=node/agent/taskmanager node.id=xnmja6254jyia6kahqri8cune service.id=ofbzjkclec63bkzr47mnzzyei task.id=s97f0g29121agb0ew8tegqc95
Nov 24 13:32:51 qemuriscv dockerd[430]: time="2019-11-24T13:32:51.716350921Z" level=info msg="API listen on /run/docker.sock"
Nov 24 13:33:06 qemuriscv dockerd[430]: time="2019-11-24T13:33:06.884992784Z" level=info msg="Download failed, retrying (1/5): local error: tls: bad record MAC"
Nov 24 13:33:14 qemuriscv dockerd[430]: time="2019-11-24T13:33:14.156336788Z" level=info msg="Download failed, retrying (2/5): local error: tls: bad record MAC"
Nov 24 13:33:14 qemuriscv dockerd[430]: time="2019-11-24T13:33:14.177138551Z" level=info msg="Download failed, retrying (3/5): local error: tls: bad record MAC"
Nov 24 13:33:25 qemuriscv dockerd[430]: time="2019-11-24T13:33:25.986011763Z" level=info msg="Download failed, retrying 
...

Steps to reproduce the issue:

  1. Fetch Qemu Virtual machine from https://github.com/carlosedp/riscv-bringup/releases/download/v1.0/debian-riscv64-20181123.tar.bz2
  2. Unpack the VM and run with run_debian.sh script.
  3. Download Docker for Risc-V built from master on 22/11/2019 in the VM from https://github.com/carlosedp/riscv-bringup/releases/download/v1.0/docker-19.03.5-dev_riscv64.deb
  4. Install Docker with sudo apt install ./docker-19.03.5-dev_riscv64.deb
  5. Try pulling a Docker image like docker pull carlosedp/faas-gateway:riscv64
  6. Check journal with sudo journalctl -xef -u docker for error messages.

Describe the results you received:

Docker keeps printing local error: tls: bad record MAC and retrying some layers until eventually pulling the image.

Describe the results you expected:

Image get pulled with no errors.

Additional information you deem important (e.g. issue happens only occasionally):

I also tried disabling all checksumming from eth0 in the VM (as I experienced similar problems on ARM before) but no behavior change:

❯ sudo ethtool --offload eth0 rx off tx off
sudo ethtool -K eth0 rx off tx off
sudo ethtool -K eth0 gso off
sudo ethtool -K eth0 gro off
sudo ethtool -K eth0 tso off
Cannot change rx-checksumming
Cannot change tx-checksumming
Cannot change rx-checksumming
Cannot change tx-checksumming
Cannot change tcp-segmentation-offload
                                                                                                                                                       
❯ sudo ethtool -k eth0
Features for eth0:
rx-checksumming: off [fixed]
tx-checksumming: off
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: off [fixed]
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: off
	tx-scatter-gather: off [fixed]
	tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
	tx-tcp-segmentation: off [fixed]
	tx-tcp-ecn-segmentation: off [fixed]
	tx-tcp-mangleid-segmentation: off [fixed]
	tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]

Output of docker version:

❯ docker version
Client:
 Version:           unknown-version
 API version:       1.40
 Go version:        devel +1c193b20a6 in the future
 Git commit:        016a3232
 Built:             Fri Nov 22 16:22:26 2019
 OS/Arch:           linux/riscv64
 Experimental:      true

Server:
 Engine:
  Version:          dev
  API version:      1.41 (minimum version 1.12)
  Go version:       devel +1c193b20a6 in the future
  Git commit:       d1d5f64766-unsupported
  Built:            Fri Nov 22 18:12:30 2019
  OS/Arch:          linux/riscv64
  Experimental:     false
 containerd:
  Version:          v1.3.1
  GitCommit:        c7a4f874b3267c499484aae602d1257b12d69e40
 docker-init:
  Version:          0.18.0
  GitCommit:        3c53686

Output of docker info:

❯ docker info
Client:
 Debug Mode: false

Server:
 Containers: 28
  Running: 7
  Paused: 0
  Stopped: 21
 Images: 13
 Server Version: dev
 Storage Driver: overlay2
  Backing Filesystem: <unknown>
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: xnmja6254jyia6kahqri8cune
  Is Manager: true
  ClusterID: z5l4k24sfd7z92uygtl1vcgma
  Managers: 1
  Nodes: 1
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.0.2.15
  Manager Addresses:
   10.0.2.15:2377
 Runtimes: crun runc
 Default Runtime: crun
 Init Binary: docker-init
 containerd version: c7a4f874b3267c499484aae602d1257b12d69e40
 runc version: N/A
 init version: 3c53686 (expected: fec3683)
 Kernel Version: 5.3.0-rc4-g6c2a8bd-dirty
 Operating System: Debian GNU/Linux bullseye/sid
 OSType: linux
 Architecture: riscv64
 CPUs: 6
 Total Memory: 3.856GiB
 Name: qemuriscv
 ID: LWAK:QM3O:5VSI:VHOX:YBL5:3N6T:KOVA:IX5U:PR2A:77EG:K3VU:J6YK
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Qemu Virtual Machine running Debian RootFS and Kernel 5.3-rc4.

I’m using crun (https://github.com/giuseppe/crun) as runtime since Runc depends on CGO which is not available on Risc-V.

My Kernel is built with this config: https://gist.github.com/carlosedp/4100815296675d126334f8f6afba4e93

Similar error to https://github.com/moby/moby/issues/38746.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 30 (13 by maintainers)

Commits related to this issue

Most upvoted comments

I’m not sure if this is related but I saw something that matches this description a couple of months ago. It was really mysterious, I did some debugging but didn’t really get to anything useful.

To me, it was not risc-v related, just qemu with user-net on a mac. I saw some net corruption from processes like git and docker. When I tried to trace what was going on it seemed that sometimes a packet was missing the last byte. Sounds unbelievable but not making this up. I could even repro it with nc in a very basic linux vm in the end iirc. When I traced the network traffic with qemu or tcpdump the missing byte was there. But it was not after syscall returned, or even in the strace/bpftrace output. I tried different versions of qemu as well. And I believe even different machines (cc @tiborvass)