zrepl: Help debugging "tls: bad certificate" errors
My zrepl instance is failing with tls: bad certificate errors. The logs say much more than that. How can I debug the problem to figure out what’s wrong with my certs?
My setup is a single server with several pull jobs and 8 other servers with one source job apiece. They all use self-signed TLS certificates. The setup was previously working fine, until I replaced the pull server. I transferred the existing destination dataset with zfs send/recv, transfered the zrepl.yml file, generated new certificates and deployed them. Now, on the source machine the logs look like this:
2021-09-17T20:25:00.339795+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=job span=Uv38$bWRc$bWRc.RRg7.L7R2.EwCY msg="start replication"
2021-09-17T20:25:00.339941+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=repl span=Uv38$bWRc$bWRc.RRg7.L7R2.EwCY.saFp msg="start planning"
2021-09-17T20:25:00.645182+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=rpc.ctrl span=Uv38$bWRc$bWRc.RRg7.L7R2 msg="cannot connect" err="could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate"
2021-09-17T20:25:00.645217+00:00 PULLHOST zrepl[32409] 2021-09-17T20:25:00Z [ERRO][recv_SOURCEHOST][rpc.ctrl][Uv38$bWRc$bWRc.RRg7.L7R2]: cannot connect err="could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate"
2021-09-17T20:25:00.645467+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=repl span=Uv38$bWRc$bWRc.RRg7.L7R2.EwCY.saFp msg="error listing sender filesystems" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate\"" errType=*status.Error
2021-09-17T20:25:00.645508+00:00 PULLHOST zrepl[32409] 2021-09-17T20:25:00Z [ERRO][recv_SOURCEHOST][repl][Uv38$bWRc$bWRc.RRg7.L7R2.EwCY.saFp]: error listing sender filesystems err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate\"" errType="*status.Error"
2021-09-17T20:25:00.645658+00:00 PULLHOST zrepl[32409] 2021-09-17T20:25:00Z [ERRO][recv_SOURCEHOST][repl][Uv38$bWRc$bWRc.RRg7.L7R2.EwCY]: most recent error in this attempt err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate\"" attempt_number="0"
2021-09-17T20:25:00.645666+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=repl span=Uv38$bWRc$bWRc.RRg7.L7R2.EwCY msg="most recent error in this attempt" attempt_number=0 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate\""
2021-09-17T20:25:00.645743+00:00 PULLHOST zrepl[32409] 2021-09-17T20:25:00Z [ERRO][recv_SOURCEHOST][repl][Uv38$bWRc$bWRc.RRg7.L7R2.EwCY]: temporary connectivity-related error identified, start waiting for reconnect attempt_number="0" deadline="2021-09-17 20:35:00.645669946 +0000 UTC m=+1158.970501483"
2021-09-17T20:25:00.645765+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=repl span=Uv38$bWRc$bWRc.RRg7.L7R2.EwCY msg="temporary connectivity-related error identified, start waiting for reconnect" attempt_number=0 deadline=2021-09-17T20:35:00.645669946Z
2021-09-17T20:25:00.951495+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=rpc.data span=Uv38$bWRc$YW7d$YW7d.RLLY msg="ping failed" err="could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate"
2021-09-17T20:25:00.951536+00:00 PULLHOST zrepl[32409] 2021-09-17T20:25:00Z [ERRO][recv_SOURCEHOST][rpc.data][Uv38$bWRc$YW7d$YW7d.RLLY]: ping failed err="could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate"
2021-09-17T20:25:00.951599+00:00 PULLHOST zrepl[32409] 2021-09-17T20:25:00Z [ERRO][recv_SOURCEHOST][rpc.ctrl][Uv38$bWRc$YW7d$YW7d.RLLY]: ping failed err="rpc error: code = Canceled desc = latest balancer error: connection error: desc = \"transport: Error while dialing could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate\""
2021-09-17T20:25:00.951627+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=rpc.ctrl span=Uv38$bWRc$YW7d$YW7d.RLLY msg="ping failed" err="rpc error: code = Canceled desc = latest balancer error: connection error: desc = \"transport: Error while dialing could not decode protocol banner: error reading protocol banner length: remote error: tls: bad certificate\""
2021-09-17T20:25:00.951765+00:00 PULLHOST zrepl[32409] 2021-09-17T20:25:00Z [ERRO][recv_SOURCEHOST][repl][Uv38$bWRc$bWRc.RRg7.L7R2.EwCY]: reconnecting failed, aborting run attempt_number="0" err="sender is not reachable: control and data rpc failed to respond to ping rpcs"
2021-09-17T20:25:00.951778+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=repl span=Uv38$bWRc$bWRc.RRg7.L7R2.EwCY msg="reconnecting failed, aborting run" attempt_number=0 err="sender is not reachable: control and data rpc failed to respond to ping rpcs"
2021-09-17T20:25:00.951993+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=job span=Uv38$bWRc$bWRc.RRg7.L7R2.ynsf msg="start pruning sender"
2021-09-17T20:25:00.952195+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=job span=Uv38$bWRc$bWRc.RRg7.L7R2.ynsf msg="finished pruning sender"
2021-09-17T20:25:00.952384+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=job span=Uv38$bWRc$bWRc.RRg7.L7R2.jxWn msg="start pruning receiver"
2021-09-17T20:25:00.952533+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=job span=Uv38$bWRc$bWRc.RRg7.L7R2.jxWn msg="finished pruning receiver"
2021-09-17T20:25:00.952713+00:00 PULLHOST zrepl[32410]: job=recv_SOURCEHOST subsystem=job span=Uv38$bWRc$bWRc.RRg7 msg="wait for wakeups"
And on the source hosts, like this:
2021-09-17T20:25:00.596100+00:00 SOURCEHOST zrepl[1905]: job=snap_and_serve_PULLHOST subsystem=transportmux span=8x/p$Wh6Z$Wh6Z.eh8S msg="accept error" err="tls: client didn't provide a certificate" errType=*errors.errorString
2021-09-17T20:25:00.596176+00:00 SOURCEHOST zrepl[1904] 2021-09-17T20:25:00Z [ERRO][snap_and_serve_PULLHOST][transportmux][8x/p$Wh6Z$Wh6Z.eh8S]: accept error errType="*errors.errorString" err="tls: client didn't provide a certificate"
2021-09-17T20:25:00.902540+00:00 SOURCEHOST zrepl[1905]: job=snap_and_serve_PULLHOST subsystem=transportmux span=8x/p$Wh6Z$Wh6Z.eh8S msg="accept error" err="tls: client didn't provide a certificate" errType=*errors.errorString
2021-09-17T20:25:00.902646+00:00 SOURCEHOST zrepl[1904] 2021-09-17T20:25:00Z [ERRO][snap_and_serve_PULLHOST][transportmux][8x/p$Wh6Z$Wh6Z.eh8S]: accept error err="tls: client didn't provide a certificate" errType="*errors.errorString"
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (13 by maintainers)
Commits related to this issue
- stderr debugging for #517 (tls: bad certificate) refs #517 cc @asomers — committed to zrepl/zrepl by problame 3 years ago
- stderr debugging for #517 (tls: bad certificate) refs #517 cc @asomers — committed to zrepl/zrepl by problame 3 years ago
So that tls.permanentError is merely the SSL Alert code
42 bad certificate. (https://techcommunity.microsoft.com/t5/iis-support-blog/ssl-tls-alert-protocol-and-the-alert-codes/ba-p/377132)So we’d probably need more debug logging on the source side as well, sorry about that.
I found https://github.com/etcd-io/etcd/issues/9785 in etcd (also Go) where the root cause was that the key usage fields in the TLS certs was incorrect.
Let me code up a patch that adds source side stderr debugging so that we know the root cause of this and can have better log messages in the future. Please do not delete the certs you are currently using.