telegraf: Metric corruption when using http2 nginx reverse proxy with influxdb output
Bug report
Telegraf stopped working with nginx reverse proxy since 1.3 release
Relevant telegraf.conf:
# Telegraf configuration
[tags]
# Configuration for telegraf agent
[agent]
interval = "10s"
debug = false
hostname = "my.host.fqdn"
round_interval = true
flush_interval = "10s"
flush_jitter = "0s"
collection_jitter = "0s"
metric_batch_size = 1000
metric_buffer_limit = 10000
quiet = false
[[outputs.influxdb]]
urls = ["https://host:8086"]
database = "telegraf_metrics"
username = "telegraf_metrics"
password = "pass"
retention_policy = ""
write_consistency = "any"
timeout = "5s"
[[inputs.cpu]]
percpu = true
totalcpu = true
fielddrop = ["time_*"]
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.mem]]
[[inputs.system]]
[[inputs.swap]]
[[inputs.internal]]
[[inputs.kernel]]
[[inputs.processes]]
[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]
[[inputs.kernel_vmstat]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.nstat]]
dump_zeros = true
[[inputs.conntrack]]
files = ["ip_conntrack_count","ip_conntrack_max", "nf_conntrack_count","nf_conntrack_max"]
dirs = ["/proc/sys/net/ipv4/netfilter","/proc/sys/net/netfilter"]
System info:
Telegraf 1.3.0-1, InfluxDB 1.2.4-1 , nginx 1.13.0-1~xenial
Steps to reproduce:
- Install telegraf+influxdb+nginx
- make a simple reverse proxy ( location / { proxy_pass http://influxdb:8086; } )
- start pumping metrics from telegraf (telegraf -> nginx -> influxdb)
- From this point, influxdb will be filled with broken, unreadable metrics. They looks like partially-written metrics, something like ???st=some.of.host.i.monitor, or ???sk=sda1.
Expected behavior:
Metrics should be arriving proparly
Actual behavior:
Lots of broken metrics with strange unicode (?) symbols.
Additional info:
Sometimes i saw lines like “2017-05-21T03:33:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse ‘.ru timst=rails-2 total=4387237888i,used=2909868032i,free=1477369856i,used_percent=66.32574084845248 1495337590000000000’: invalid boolean]”
I suspect this is somehow related to https://github.com/influxdata/telegraf/pull/2251
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 4
- Comments: 53 (27 by maintainers)
@scanterog Has been testing with
content_encoding = "gzip"
for about a day without corruption, so I think it is the Content-Length. I will get it fixed.I’m looking through the code again, and I think it may be possible for the Content-Length header to be incorrect. Can someone, if it has not already been done, test with the
content_encoding = "gzip"
option? This would disable the Content-Length header and use chunked transfer encoding.Does setting the following environment variable at telegraf runtime workaround the issue?
It works if i take nginx out - it’s how it works right now. It also works with reverse proxy if i downgrade telegraf back to 1.2.1. I also observed that problem with some self-built version (looking at my clone of repo, that was c66e2896c658cc2d3ccf9fdec3be8072e87162a2) but that was on testbox and i forgot to post issue for it.
When upgrading to 1.3 i mostly left config intact except adding 2 new inputs:
Locale output:
Telegraf envs, ubuntu 16.04 LTS:
Debian 8:
Here is full nginx configs: nginx.conf:
Corresponding vhost:
Nginx also auto-includes following settings (in http {} context) Gzip:
Open file cache:
SSL settings: