VictoriaMetrics: [vmauth]: a lot of TLS handshake errors

Describe the bug We have vmauth in front of vminsert. Now I observed that there are a lot of handshake errors like the following:


Oct 12 16:22:53 insert.example.com vmauth[3599771]: 2021-10-12T14:22:53.181Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48540: EOF
Oct 12 16:22:53 insert.example.com vmauth[3599771]: 2021-10-12T14:22:53.186Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48538: EOF
Oct 12 16:22:53 insert.example.com vmauth[3599771]: 2021-10-12T14:22:53.186Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48536: EOF
Oct 12 16:22:54 insert.example.com vmauth[3599771]: 2021-10-12T14:22:54.182Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48552: EOF
Oct 12 16:22:54 insert.example.com vmauth[3599771]: 2021-10-12T14:22:54.182Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48550: EOF
Oct 12 16:22:54 insert.example.com vmauth[3599771]: 2021-10-12T14:22:54.182Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48546: EOF
Oct 12 16:22:54 insert.example.com vmauth[3599771]: 2021-10-12T14:22:54.185Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48548: EOF
Oct 12 16:22:57 insert.example.com vmauth[3599771]: 2021-10-12T14:22:57.184Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48562: EOF
Oct 12 16:22:57 insert.example.com vmauth[3599771]: 2021-10-12T14:22:57.184Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48570: EOF
Oct 12 16:22:57 insert.example.com vmauth[3599771]: 2021-10-12T14:22:57.186Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48568: EOF
Oct 12 16:22:57 insert.example.com vmauth[3599771]: 2021-10-12T14:22:57.186Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48564: EOF
Oct 12 16:22:59 insert.example.com vmauth[3599771]: 2021-10-12T14:22:59.204Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48578: EOF
Oct 12 16:22:59 insert.example.com vmauth[3599771]: 2021-10-12T14:22:59.204Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48574: EOF
Oct 12 16:22:59 insert.example.com vmauth[3599771]: 2021-10-12T14:22:59.204Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48572: EOF
Oct 12 16:22:59 insert.example.com vmauth[3599771]: 2021-10-12T14:22:59.205Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48576: EOF
Oct 12 16:23:00 insert.example.com vmauth[3599771]: 2021-10-12T14:23:00.181Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48590: EOF
Oct 12 16:23:00 insert.example.com vmauth[3599771]: 2021-10-12T14:23:00.183Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48586: EOF
Oct 12 16:23:00 insert.example.com vmauth[3599771]: 2021-10-12T14:23:00.185Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48592: EOF
Oct 12 16:23:00 insert.example.com vmauth[3599771]: 2021-10-12T14:23:00.186Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48588: EOF
Oct 12 16:23:08 insert.example.com vmauth[3599771]: 2021-10-12T14:23:08.182Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48622: EOF
Oct 12 16:23:08 insert.example.com vmauth[3599771]: 2021-10-12T14:23:08.185Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48620: EOF
Oct 12 16:23:08 insert.example.com vmauth[3599771]: 2021-10-12T14:23:08.185Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48618: EOF
Oct 12 16:23:08 insert.example.com vmauth[3599771]: 2021-10-12T14:23:08.185Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48616: EOF
Oct 12 16:23:09 insert.example.com vmauth[3599771]: 2021-10-12T14:23:09.181Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48630: EOF
Oct 12 16:23:09 insert.example.com vmauth[3599771]: 2021-10-12T14:23:09.181Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48634: EOF
Oct 12 16:23:09 insert.example.com vmauth[3599771]: 2021-10-12T14:23:09.183Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48632: EOF
Oct 12 16:23:12 insert.example.com vmauth[3599771]: 2021-10-12T14:23:12.190Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48650: EOF
Oct 12 16:23:12 insert.example.com vmauth[3599771]: 2021-10-12T14:23:12.190Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48656: EOF
Oct 12 16:23:12 insert.example.com vmauth[3599771]: 2021-10-12T14:23:12.190Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48652: EOF
Oct 12 16:23:12 insert.example.com vmauth[3599771]: 2021-10-12T14:23:12.193Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48654: EOF
Oct 12 16:23:14 insert.example.com vmauth[3599771]: 2021-10-12T14:23:14.184Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48660: EOF
Oct 12 16:23:14 insert.example.com vmauth[3599771]: 2021-10-12T14:23:14.185Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48662: EOF
Oct 12 16:23:14 insert.example.com vmauth[3599771]: 2021-10-12T14:23:14.188Z        error        net/http/server.go:3157        http: TLS handshake error from 192.168.1.2:48664: EOF

The logs are coming from vmagent. Even if I set:

http_idleConnTimeout=700s

in vmauth nothing changes.

I also tried:

promscrape_disableKeepAlive=true

but nothing changed. Everything works fine and I see no errors in vmagent logs.

It is annoying because I would like to create an alert for vm_log_messages_total{level!="info"} and in the vm-cluster grafana dashboard I have a large error log rate.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 19

Commits related to this issue

Most upvoted comments

It appears vmagent_remotewrite_conn_read_errors_total and vmagent_remotewrite_conn_write_errors_total are incremented whenever I see those EOF errors from the db. They are incremented at vmagent, while the errors come from the database.

The counters are described here: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/app/vmagent/remotewrite/statconn.go

I will clone the repo and add some error messaging on vmagent’s side at statconn.go to see what the actual errors are that seem to coincide with the EOF errors from the database.

For reference, you can easily reproduce EOF error with telnet as well which is why it got me thinking that this could be result of initiating TCP conn to TLS listening server and then closing it right away: telnet (your victoria-metrics service https URL) (service port) (quit telnet right away) (service should report TLS handshake EOF error)

Closing the issue due to lack of activity.

@valyala, we upgraded all components (victoria metrics, vmagent and vmalert to v1.76.0 and it atill happens Same error appears every few seconds in victoria metrics log.

I fixed the errors by tweaking --remoteWrite.queues parameter on vmagents, lowering amount of queues used for remote writes. Each queue will make a connection to the remoteWrite target so I guess too many of them saturate the target and your TLS handshakes start timing out, which in turn creates those EOF errors on the target.

I think creating TLS version of statDial would help hide the EOF log spam triggered by vmagents and instead log i/o timeouts to the metrics. Otherwise you simply need to tweak queue count settings to make vmagents less aggressive towards their targets.

I narrowed down the generation of EOF error to TLS Handshake timeout functionality. VMagent is using default Golang’s http transport TLS dialer.

In https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/app/vmagent/remotewrite/client.go#L94 there is a custom dialer for normal unencrypted TCP connection. There is no DialTLS option specified, so as per http.Transport documentation:

// DialTLSContext specifies an optional dial function for creating
// TLS connections for non-proxied HTTPS requests.
//
// If DialTLSContext is nil (and the deprecated DialTLS below is also nil),
// DialContext and TLSClientConfig are used.
//
// If DialTLSContext is set, the Dial and DialContext hooks are not used for HTTPS
// requests and the TLSClientConfig and TLSHandshakeTimeout
// are ignored. The returned net.Conn is assumed to already be
// past the TLS handshake.
DialTLSContext func(ctx context.Context, network, addr string) (net.Conn, error)

By that they mean to use addTLS function: https://golang.org/src/net/http/transport.go#L1512 called in below dialConn function to get TLS-secured connection. Specifically to distinguish between custom TLS dialer and default one you can see the logic in https://golang.org/src/net/http/transport.go#L1578 and if there is no custom TLS dialer defined then it will resort to using normal Dialer + addTLS function. Seems weird that they go with that route instead of initializing some default tls.Conn.

So what caused those EOF errors on remoteWrite target? Since my previously published read/write errors of vmagent coincided with EOF errors from the remoteWrite targets, as well as those error stack-traces originating from addTLS, then the best guess is that this specific section causes them: https://golang.org/src/net/http/transport.go#L1540 (TLS handshake timeout triggers abrupt connection closure in addTLS).

Potential fix? Not sure. @valyala I duplicated statConn as statConnTLS and made it internally use tls.Conn. The methods are the same as normal statConn, however, you must specify SetReadDeadline and SetWriteDeadline before calling tls.Conn.Read/Write respectively, they are used by tls.Conn’s internal write/read methods that first check/initiate TLS handshake before writing/reading request data.

The statConnTLS.statDialTLS method is then set to transport’s DialTLS or DialTLSContext field. One issue is that there doesn’t seem to be an easy way to pass TLSConfig object, so very likely you may want to supply an additional wrapper object that supplies DialTLS function and the dial function uses TLSConfig from the wrapper object to initiate tls.Conn.

I quickly tried this with the following:

func statDialTLS(ctx context.Context, network string, addr string) (conn net.Conn, err error) {
	network = netutil.GetTCPNetwork()
	tlsConn, tlsConnErr := tls.Dial(network, addr, nil)
	dialsTotal.Inc()
	if tlsConnErr != nil {
		dialErrors.Inc()
		logger.Errorf("statDialTLS error: %s", tlsConnErr)
		return nil, tlsConnErr
	}
	scTLS := &statConnTLS{
		Conn: tlsConn,
	}
	return scTLS, nil
}

Even with nil TLS config, the EOF errors finally disappeared from remoteWrite target. If you log errors from statConnTLS write/read then I have noticed that there are occasional i/o timeouts but doesn’t seem to affect anything, seems to be just part of usual connection churn noise.

Anyways, I am not sure if creating a custom TLS statConn dialer is the solution just to silence seemingly harmless EOF TLS errors from vmagent remoteWrite targets. I am still puzzled why these random TLS handshake timeouts happen in the first place.

Thanks for your response. We are not scraping the vminsert/vmauth endpoint. We are using vmagent that uses a remote_write to vmauth, which secures vminsert on that node. If I change the remote_write endpoint to localhost, there are no errors.