VictoriaMetrics: [vmauth]: a lot of TLS handshake errors
Describe the bug We have vmauth in front of vminsert. Now I observed that there are a lot of handshake errors like the following:
Oct 12 16:22:53 insert.example.com vmauth[3599771]: 2021-10-12T14:22:53.181Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48540: EOF
Oct 12 16:22:53 insert.example.com vmauth[3599771]: 2021-10-12T14:22:53.186Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48538: EOF
Oct 12 16:22:53 insert.example.com vmauth[3599771]: 2021-10-12T14:22:53.186Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48536: EOF
Oct 12 16:22:54 insert.example.com vmauth[3599771]: 2021-10-12T14:22:54.182Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48552: EOF
Oct 12 16:22:54 insert.example.com vmauth[3599771]: 2021-10-12T14:22:54.182Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48550: EOF
Oct 12 16:22:54 insert.example.com vmauth[3599771]: 2021-10-12T14:22:54.182Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48546: EOF
Oct 12 16:22:54 insert.example.com vmauth[3599771]: 2021-10-12T14:22:54.185Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48548: EOF
Oct 12 16:22:57 insert.example.com vmauth[3599771]: 2021-10-12T14:22:57.184Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48562: EOF
Oct 12 16:22:57 insert.example.com vmauth[3599771]: 2021-10-12T14:22:57.184Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48570: EOF
Oct 12 16:22:57 insert.example.com vmauth[3599771]: 2021-10-12T14:22:57.186Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48568: EOF
Oct 12 16:22:57 insert.example.com vmauth[3599771]: 2021-10-12T14:22:57.186Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48564: EOF
Oct 12 16:22:59 insert.example.com vmauth[3599771]: 2021-10-12T14:22:59.204Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48578: EOF
Oct 12 16:22:59 insert.example.com vmauth[3599771]: 2021-10-12T14:22:59.204Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48574: EOF
Oct 12 16:22:59 insert.example.com vmauth[3599771]: 2021-10-12T14:22:59.204Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48572: EOF
Oct 12 16:22:59 insert.example.com vmauth[3599771]: 2021-10-12T14:22:59.205Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48576: EOF
Oct 12 16:23:00 insert.example.com vmauth[3599771]: 2021-10-12T14:23:00.181Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48590: EOF
Oct 12 16:23:00 insert.example.com vmauth[3599771]: 2021-10-12T14:23:00.183Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48586: EOF
Oct 12 16:23:00 insert.example.com vmauth[3599771]: 2021-10-12T14:23:00.185Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48592: EOF
Oct 12 16:23:00 insert.example.com vmauth[3599771]: 2021-10-12T14:23:00.186Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48588: EOF
Oct 12 16:23:08 insert.example.com vmauth[3599771]: 2021-10-12T14:23:08.182Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48622: EOF
Oct 12 16:23:08 insert.example.com vmauth[3599771]: 2021-10-12T14:23:08.185Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48620: EOF
Oct 12 16:23:08 insert.example.com vmauth[3599771]: 2021-10-12T14:23:08.185Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48618: EOF
Oct 12 16:23:08 insert.example.com vmauth[3599771]: 2021-10-12T14:23:08.185Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48616: EOF
Oct 12 16:23:09 insert.example.com vmauth[3599771]: 2021-10-12T14:23:09.181Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48630: EOF
Oct 12 16:23:09 insert.example.com vmauth[3599771]: 2021-10-12T14:23:09.181Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48634: EOF
Oct 12 16:23:09 insert.example.com vmauth[3599771]: 2021-10-12T14:23:09.183Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48632: EOF
Oct 12 16:23:12 insert.example.com vmauth[3599771]: 2021-10-12T14:23:12.190Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48650: EOF
Oct 12 16:23:12 insert.example.com vmauth[3599771]: 2021-10-12T14:23:12.190Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48656: EOF
Oct 12 16:23:12 insert.example.com vmauth[3599771]: 2021-10-12T14:23:12.190Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48652: EOF
Oct 12 16:23:12 insert.example.com vmauth[3599771]: 2021-10-12T14:23:12.193Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48654: EOF
Oct 12 16:23:14 insert.example.com vmauth[3599771]: 2021-10-12T14:23:14.184Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48660: EOF
Oct 12 16:23:14 insert.example.com vmauth[3599771]: 2021-10-12T14:23:14.185Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48662: EOF
Oct 12 16:23:14 insert.example.com vmauth[3599771]: 2021-10-12T14:23:14.188Z error net/http/server.go:3157 http: TLS handshake error from 192.168.1.2:48664: EOF
The logs are coming from vmagent. Even if I set:
http_idleConnTimeout=700s
in vmauth nothing changes.
I also tried:
promscrape_disableKeepAlive=true
but nothing changed. Everything works fine and I see no errors in vmagent logs.
It is annoying because I would like to create an alert for vm_log_messages_total{level!="info"} and in the vm-cluster grafana dashboard I have a large error log rate.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 19
Commits related to this issue
- app/vmagent: reduce the probability of TLS handshake timeout when dialing the remote storage The following actions are taken: - Increase the TLS hashdshake timeout from 5 seconds to 10 seconds - Inc... — committed to VictoriaMetrics/VictoriaMetrics by valyala 2 years ago
- app/vmagent: reduce the probability of TLS handshake timeout when dialing the remote storage The following actions are taken: - Increase the TLS hashdshake timeout from 5 seconds to 10 seconds - Inc... — committed to VictoriaMetrics/VictoriaMetrics by valyala 2 years ago
- app/vmagent: properly initialize stdDialer This is a follow-up commit for 7da20a4b3fad761f6582fde6672c1a4138cd5e6d Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1699 — committed to VictoriaMetrics/VictoriaMetrics by valyala 2 years ago
- app/vmagent: properly initialize stdDialer This is a follow-up commit for 7da20a4b3fad761f6582fde6672c1a4138cd5e6d Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1699 — committed to VictoriaMetrics/VictoriaMetrics by valyala 2 years ago
It appears vmagent_remotewrite_conn_read_errors_total and vmagent_remotewrite_conn_write_errors_total are incremented whenever I see those EOF errors from the db. They are incremented at vmagent, while the errors come from the database.
The counters are described here: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/app/vmagent/remotewrite/statconn.go
I will clone the repo and add some error messaging on vmagent’s side at statconn.go to see what the actual errors are that seem to coincide with the EOF errors from the database.
For reference, you can easily reproduce EOF error with telnet as well which is why it got me thinking that this could be result of initiating TCP conn to TLS listening server and then closing it right away: telnet (your victoria-metrics service https URL) (service port) (quit telnet right away) (service should report TLS handshake EOF error)
Closing the issue due to lack of activity.
@valyala, we upgraded all components (victoria metrics, vmagent and vmalert to v1.76.0 and it atill happens Same error appears every few seconds in victoria metrics log.
I fixed the errors by tweaking
--remoteWrite.queuesparameter on vmagents, lowering amount of queues used for remote writes. Each queue will make a connection to the remoteWrite target so I guess too many of them saturate the target and your TLS handshakes start timing out, which in turn creates those EOF errors on the target.I think creating TLS version of statDial would help hide the EOF log spam triggered by vmagents and instead log i/o timeouts to the metrics. Otherwise you simply need to tweak queue count settings to make vmagents less aggressive towards their targets.
I narrowed down the generation of EOF error to TLS Handshake timeout functionality. VMagent is using default Golang’s http transport TLS dialer.
In https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/app/vmagent/remotewrite/client.go#L94 there is a custom dialer for normal unencrypted TCP connection. There is no DialTLS option specified, so as per http.Transport documentation:
By that they mean to use
addTLSfunction: https://golang.org/src/net/http/transport.go#L1512 called in belowdialConnfunction to get TLS-secured connection. Specifically to distinguish between custom TLS dialer and default one you can see the logic in https://golang.org/src/net/http/transport.go#L1578 and if there is no custom TLS dialer defined then it will resort to using normal Dialer +addTLSfunction. Seems weird that they go with that route instead of initializing some default tls.Conn.So what caused those EOF errors on remoteWrite target? Since my previously published read/write errors of vmagent coincided with EOF errors from the remoteWrite targets, as well as those error stack-traces originating from
addTLS, then the best guess is that this specific section causes them: https://golang.org/src/net/http/transport.go#L1540 (TLS handshake timeout triggers abrupt connection closure inaddTLS).Potential fix? Not sure. @valyala I duplicated statConn as statConnTLS and made it internally use tls.Conn. The methods are the same as normal statConn, however, you must specify SetReadDeadline and SetWriteDeadline before calling tls.Conn.Read/Write respectively, they are used by tls.Conn’s internal write/read methods that first check/initiate TLS handshake before writing/reading request data.
The
statConnTLS.statDialTLSmethod is then set to transport’sDialTLSorDialTLSContextfield. One issue is that there doesn’t seem to be an easy way to pass TLSConfig object, so very likely you may want to supply an additional wrapper object that supplies DialTLS function and the dial function uses TLSConfig from the wrapper object to initiate tls.Conn.I quickly tried this with the following:
Even with nil TLS config, the EOF errors finally disappeared from remoteWrite target. If you log errors from statConnTLS write/read then I have noticed that there are occasional i/o timeouts but doesn’t seem to affect anything, seems to be just part of usual connection churn noise.
Anyways, I am not sure if creating a custom TLS statConn dialer is the solution just to silence seemingly harmless EOF TLS errors from vmagent remoteWrite targets. I am still puzzled why these random TLS handshake timeouts happen in the first place.
Thanks for your response. We are not scraping the vminsert/vmauth endpoint. We are using vmagent that uses a remote_write to vmauth, which secures vminsert on that node. If I change the remote_write endpoint to localhost, there are no errors.