zrepl: zrepl status: intermittent socket connection errors
Summary
I’ve been having intermittent slowdowns of the daemon on the pull side, it seems to occur after the daemon has been running for several hours. It is noticeable in the status page where errors flash about the Unix socket and EOF, or trailer.
Once this starts occurring there seems to be a significant slowdown, and after restarting the demon normal performance is seen again.
System
Server:
- OS: FreeNAS-11.2-U7
server: zrepl version=v0.2.0 GOOS=freebsd GOARCH=amd64 Compiler=gc
Client:
- Arch Linux 5.4.10-arch1-1
client: zrepl version=arch-0.2.1 GOOS=linux GOARCH=amd64 Compiler=gc
Configuration
Here’s my configuration on the pull server side, I have multiple hosts I’m pulling from but their configuration is all identical.
global:
logging:
- type: "stdout"
level: "info"
format: "human"
jobs:
- name: wooly
type: pull
connect:
type: tls
address: "172.20.20.3:8888"
ca: "/mnt/tank/system/root/etc/zrepl/wooly.ramsden.network.crt"
cert: "/mnt/tank/system/root/etc/zrepl/lilan.ramsden.network.crt"
key: "/mnt/tank/system/root/etc/zrepl/lilan.ramsden.network.key"
server_cn: "wooly.ramsden.network"
root_fs: "tank/repl/wooly"
interval: 15m
pruning:
keep_sender:
- type: grid
grid: 1x1h(keep=all) | 24x1h | 30x1d
regex: "zrepl_.*"
keep_receiver:
- type: grid
grid: 1x1h(keep=all) | 24x1h | 60x1d | 24x30d
regex: "zrepl_.*"
The client servers all have multiple datasets, around 18+ that are being replicated.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 49 (47 by maintainers)
Commits related to this issue
- daemon/control: envconst timeout for control socket server-side timeouts refs #262 — committed to zrepl/zrepl by problame 4 years ago
- prevent transient zrepl status error: Post "http://unix/status": EOF See the comment added to client.go in this commit. fixes https://github.com/zrepl/zrepl/issues/483 fixes https://github.com/zrepl... — committed to zrepl/zrepl by problame 2 years ago
In general: yes. Make sure you have sufficient disk space for the .json (watch one replication interval to gauge how much space it takes on your system).
Prometheus might be useful to gather general system stats, but not zrepl stats. So … I wouldn’t waste my time on standing up a Prometheus setup just to debug this.
Use
zrepl pprof on 127.0.0.1:someport, then use pprof to find out why things are slow: https://blog.golang.org/pprofI can help via Voice call if you want, please shoot me an email or join #zrepl on freenode.
Sorry this took so long. I pushed a commit (should show up above this comment) that enables configuration of read and write timeouts via environment variables. Could you try the CI binaries on your system & play around with the environment variables on the daemon where you observe the
EOFs?The patch is based on the upcoming 0.3 release, which I invite you to test (you will have to build the docs yourself or download them from CI, too (
zrepl-noarch.tar).A good introduction: https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/ No, you don’t need the go compiler, but no matter whether you run
go tool pprofon your laptop or FreeNAS directly, you need to pass it the zrepl binary that is executing, i.e., the one on FreeNAS.