restic: System-wide hang after a few weeks

Output of `restic version`

ianh@feral:~$ restic version
restic 0.11.0 compiled with go1.15.9 on linux/amd64
ianh@feral:~$ uname -a
Linux feral 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
ianh@feral:~$ cat /proc/cpuinfo | grep model
model           : 122
model name      : Intel(R) Celeron(R) N4000 CPU @ 1.10GHz
model           : 122
model name      : Intel(R) Celeron(R) N4000 CPU @ 1.10GHz
ianh@feral:~$ cat /proc/meminfo | head -3
MemTotal:        3841828 kB
MemFree:          124884 kB
MemAvailable:    3083456 kB
ianh@feral:~$ df -hal
Filesystem      Size  Used Avail Use% Mounted on
sysfs              0     0     0    - /sys
proc               0     0     0    - /proc
udev            1.9G     0  1.9G   0% /dev
devpts             0     0     0    - /dev/pts
tmpfs           376M  816K  375M   1% /run
/dev/mmcblk0p2   55G  1.4G   51G   3% /
securityfs         0     0     0    - /sys/kernel/security
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
cgroup2            0     0     0    - /sys/fs/cgroup
pstore             0     0     0    - /sys/fs/pstore
efivarfs           0     0     0    - /sys/firmware/efi/efivars
none               0     0     0    - /sys/fs/bpf
systemd-1          -     -     -    - /proc/sys/fs/binfmt_misc
hugetlbfs          0     0     0    - /dev/hugepages
mqueue             0     0     0    - /dev/mqueue
debugfs            0     0     0    - /sys/kernel/debug
tracefs            0     0     0    - /sys/kernel/tracing
configfs           0     0     0    - /sys/kernel/config
fusectl            0     0     0    - /sys/fs/fuse/connections
/dev/mmcblk0p1  511M  5.8M  506M   2% /boot/efi
/dev/nvme0n1p1  1.8T   23G  1.7T   2% /home
tmpfs           376M     0  376M   0% /run/user/1000
binfmt_misc        0     0     0    - /proc/sys/fs/binfmt_misc

How did you run restic exactly?

export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export RESTIC_REPOSITORY="s3:s3.us-west-000.backblazeb2.com/..."
export RESTIC_PASSWORD_FILE=/home/ianh/restic/password.key
restic init
# the output is lost to time, but matched the usual output seen in tutorials
restic backup /mnt/*
# the output is what I would expect (i.e. just listing files it is backing up), although occasionally I do see lines like:
Save(<data/...>) returned error, retrying after 619.645675ms: client.PutObject: An internal error occurred.  Please retry your upload.                                                                                                                   |

The files being backed up (/mnt/*) are CIFS mounts from a Drobo NFS. The Drobo and the NUC running restic have fixed IPs. Both are protected by separate UPSes. Both are connected to the network via gigabit Ethernet (no wifi).

What backend/server/service did you use to store the repository?

Backblaze via the s3 restic backend.

Expected behavior

System should back up successfully.

Actual behavior

I’ve done this twice so far, once on a headless raspberry pi, and once on a headless NUC (the details for the latter are above). In both cases they were running some variant of Linux, accessed over SSH.

In both cases, everything seemed to be going great for the first few days (I have terabytes of data to back up). However, in both cases, after a few weeks, I went to check on the computer and could not ssh into it. In the case of the raspberry pi, I was never able to figure out what happened. The microSD card was trashed, and I could never get anything out of it (e.g. logs) to determine what happened. The problem occurred during a heat wave and I originally just assumed that, combined with the low capabilities of the pi and other software running on it at the time, was the problem.

More recently though I got a new NUC, freshly installed, no other tasks running on it at all. I set it up to do the backup as described above. I cleared the backblaze bucket first so it was a fresh backup too. At first everything seemed great, but when I checked up on it yesterday, the machine had hard hung. It was no longer doing DHCP or ARP (and its lease had expired). I connected a display and keyboard to the NUC to see what was happening; the keyboard did not respond (capslock did not toggle its LED, for example), and on the screen there was what I presume was a kernel crash. I regret that I did not think to take a photo, thinking it would be visible in the logs.

Unfortunately when I examined the logs I found nothing. Based on my DHCP logs, the machine disconnected from the network around 9am (I have short leases of around 15 minutes), but the latest entries in the log files all dated from hours earlier or were mundane entries like renewing DHCP. Whatever caused the problem prevented the last few seconds of logs from being written to disk; some of the log files ended with NULs (upon rebooting, there was a message about the filesystem having to replay the journal). Also I was not, at the time, logging restic output to disk.

Steps to reproduce the behavior

I don’t know. So far I have had a 100% success rate at reproducing this by just using restic normally to back up several terabytes and then just waiting a few weeks (about 50% through the total backup, very roughly), but obviously two data points do not make a trend and in particular the first hang happened among many confounding circumstances.

Do you have any idea what may have caused this?

No.

Do you have an idea how to solve the issue?

No.

Thoughts

I’m running restic again now with the log being teed to disk and with data from /proc/meminfo and /proc/pressure/* being logged every minute in case that shows a trend. I am filing this issue in part in the hope that you will recognise the symptoms as those of some misconfiguration error I’ve made, and in part in the hope that you will suggest other things I could add to my “log system state every minute” script so that we can debug the cause if it happens again.

FWIW, currently I’m running:

echo ""
date
cat /proc/meminfo | grep MemFree
grep full /proc/pressure/*

…every minute and the first and latest log entries are:

Wed 04 Jan 2023 04:14:55 PM PST
MemFree:          112780 kB
/proc/pressure/io:full avg10=3.92 avg60=3.93 avg300=3.99 total=75934965
/proc/pressure/memory:full avg10=0.00 avg60=0.04 avg300=0.00 total=601574

[...]

Thu 05 Jan 2023 01:20:51 PM PST
MemFree:          115972 kB
/proc/pressure/io:full avg10=3.23 avg60=3.20 avg300=3.20 total=7854277124
/proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=73134839

I suppose the problem could be something like a bug in the CIFS kernel module where after some significant amount of network traffic, it crashes.

Did restic help you today? Did it make you happy in any way?

I love the potential of restic! I really hope this is just an aberration because the promise of easy backups is extremely attractive. So far all my interactions with y’all have been super positive.

About this issue

Original URL
State: closed
Created a year ago
Comments: 32 (6 by maintainers)

Most upvoted comments

@AlBundy33 Restic uses an index which keeps information about each stored file chunk and which is used to deduplicate data. For performance / simplicity that index is fully loaded into memory. As each file chunk is listed in the index, it will grow as more data is added to a repository.

The file metadata inside a folder also requires additional memory, but unless you have folder which directly contains hundreds of thousands of files (not recursively), the main memory user will be the index.

MichaelEischer on Jan 8, 2023