telegraf: Metric corruption when using http2 nginx reverse proxy with influxdb output

Bug report

Telegraf stopped working with nginx reverse proxy since 1.3 release

Relevant telegraf.conf:

# Telegraf configuration

[tags]

# Configuration for telegraf agent
[agent]
    interval = "10s"
    debug = false
    hostname = "my.host.fqdn"
    round_interval = true
    flush_interval = "10s"
    flush_jitter = "0s"
    collection_jitter = "0s"
    metric_batch_size = 1000
    metric_buffer_limit = 10000
    quiet = false

[[outputs.influxdb]]
    urls = ["https://host:8086"]
    database = "telegraf_metrics"
    username = "telegraf_metrics"
    password = "pass"
    retention_policy = ""
    write_consistency = "any"
    timeout = "5s"

[[inputs.cpu]]
    percpu = true
    totalcpu = true
    fielddrop = ["time_*"]
[[inputs.disk]]
    ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.mem]]
[[inputs.system]]
[[inputs.swap]]
[[inputs.internal]]
[[inputs.kernel]]
[[inputs.processes]]
[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]
[[inputs.kernel_vmstat]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.nstat]]
    dump_zeros = true
[[inputs.conntrack]]
    files = ["ip_conntrack_count","ip_conntrack_max", "nf_conntrack_count","nf_conntrack_max"]
    dirs = ["/proc/sys/net/ipv4/netfilter","/proc/sys/net/netfilter"]

System info:

Telegraf 1.3.0-1, InfluxDB 1.2.4-1 , nginx 1.13.0-1~xenial

Steps to reproduce:

Install telegraf+influxdb+nginx
make a simple reverse proxy ( location / { proxy_pass http://influxdb:8086; } )
start pumping metrics from telegraf (telegraf -> nginx -> influxdb)
From this point, influxdb will be filled with broken, unreadable metrics. They looks like partially-written metrics, something like ???st=some.of.host.i.monitor, or ???sk=sda1.

Expected behavior:

Metrics should be arriving proparly

Actual behavior:

Lots of broken metrics with strange unicode (?) symbols.

Additional info:

Sometimes i saw lines like “2017-05-21T03:33:10Z E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: unable to parse ‘.ru timst=rails-2 total=4387237888i,used=2909868032i,free=1477369856i,used_percent=66.32574084845248 1495337590000000000’: invalid boolean]”

I suspect this is somehow related to https://github.com/influxdata/telegraf/pull/2251

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 4
Comments: 53 (27 by maintainers)

Most upvoted comments

@scanterog Has been testing with content_encoding = "gzip" for about a day without corruption, so I think it is the Content-Length. I will get it fixed.

danielnelson on Sep 29, 2017

I’m looking through the code again, and I think it may be possible for the Content-Length header to be incorrect. Can someone, if it has not already been done, test with the content_encoding = "gzip" option? This would disable the Content-Length header and use chunked transfer encoding.

danielnelson on Sep 15, 2017

Does setting the following environment variable at telegraf runtime workaround the issue?

GODEBUG=http2client=0

bobmshannon on Aug 13, 2017

It works if i take nginx out - it’s how it works right now. It also works with reverse proxy if i downgrade telegraf back to 1.2.1. I also observed that problem with some self-built version (looking at my clone of repo, that was c66e2896c658cc2d3ccf9fdec3be8072e87162a2) but that was on testbox and i forgot to post issue for it.

When upgrading to 1.3 i mostly left config intact except adding 2 new inputs:

[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]

Locale output:

root@heimdall:~# locale
LANG=en_US.utf8
LANGUAGE=
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=

Telegraf envs, ubuntu 16.04 LTS:

root@heimdall:~# cat /proc/22480/environ
LANG=en_US.utf8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binHOME=/etc/telegrafLOGNAME=telegrafUSER=telegrafSHELL=/bin/false

Debian 8:

LANG=en_US.UTF-8LC_CTYPE=en_US.UTF-8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binHOME=/etc/telegrafLOGNAME=telegrafUSER=telegrafSHELL=/bin/false

Here is full nginx configs: nginx.conf:

#Ansible managed, do not touch
user nginx  nginx;

worker_processes 4;

pid /var/run/nginx.pid;

worker_rlimit_nofile 1024;

include /etc/nginx/modules-enabled/*.conf;


events {
        worker_connections 512;
        use epoll;
        multi_accept on;
}


http {
        include /etc/nginx/mime.types;
        default_type application/octet-stream;
        sendfile on;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log error;
        server_tokens off;
        types_hash_max_size 2048;

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;
}

Corresponding vhost:

#Ansible managed, do not touch
server {
   listen snip:80;
   listen snip:443 ssl http2;
   listen [snip]:443 ssl http2;
   listen [snip]:80;
   server_name snip;
   ssl_certificate /etc/nginx/ssl/snip/fullchain.pem;
   ssl_certificate_key /etc/nginx/ssl/snip/privkey.pem;
   ssl_dhparam /etc/nginx/ssl/dhparam.pem;
   client_max_body_size 10M;
   if ($scheme = http) {
       return 301 https://$server_name$request_uri;

   }
   location /.well-known/acme-challenge/ {
       alias /var/www/letsencrypt/;

   }
   location / {
       proxy_pass http://localhost:3000;
       proxy_set_header Host $host;
       proxy_set_header X-Real-IP $remote_addr;
       proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
       proxy_set_header X-Forwarded-Proto $scheme;

   }
}

Nginx also auto-includes following settings (in http {} context) Gzip:

#Ansible managed, do not touch
gzip on;
gzip_comp_level 5;
gzip_min_length 256;
gzip_http_version 1.1;
gzip_buffers 16 8k;
gzip_proxied any;
gzip_vary on;
gzip_disable "MSIE [1-6]\.(?!.*SV1)";
gzip_types application/atom+xml application/javascript application/json application/rss+xml application/vnd.ms-fontobject application/x-font-ttf application/x-javascript application/x-web-app-manifest+json application/xhtml+xml application/xml font/opentype image/svg+xml image/x-icon text/css text/plain text/x-component text/javascript;

Open file cache:

#Ansible managed, do not touch
open_file_cache max=200000 inactive=300s;
open_file_cache_valid 300s;
open_file_cache_min_uses 2;
open_file_cache_errors on;

SSL settings:

#Ansible managed, do not touch
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_prefer_server_ciphers on;
ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH";
ssl_ecdh_curve secp384r1;
ssl_session_cache shared:SSL:10m;
ssl_session_tickets off;
ssl_stapling on;
ssl_stapling_verify on;
add_header Strict-Transport-Security "max-age=63072000";

rlex on May 29, 2017