datadog-agent: Agent v6.5.2 broken logs from docker

Output of the info page (if this is a bug)


==============
Agent (v6.5.2)
==============

  Status date: 2018-09-28 09:34:53.577571 UTC
  Pid: 808
  Python Version: 2.7.15
  Logs: 
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -346µs
    System UTC time: 2018-09-28 09:34:53.577571 UTC

  Host Info
  =========
    bootTime: 2018-09-27 19:59:30.000000 UTC
    kernelVersion: 3.10.0-693.2.2.el7.x86_64
    os: linux
    platform: centos
    platformFamily: rhel
    platformVersion: 7.4.1708
    procs: 148
    uptime: 11s

  Hostnames
  =========
    hostname: k***s.com
    socket-fqdn: h***2.hostwindsdns.com.
    socket-hostname: h***2.hostwindsdns.com
    hostname provider: configuration

=========
Collector
=========

  Running Checks
  ==============
    
    cpu
    ---
        Instance ID: cpu [OK]
        Total Runs: 3,261
        Metric Samples: 6, Total: 19,560
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s
        
    
    disk (1.3.0)
    ------------
        Instance ID: disk:e5dffb8bef24336f [OK]
        Total Runs: 3,261
        Metric Samples: 58, Total: 175,170
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 40ms
        
    
    file_handle
    -----------
        Instance ID: file_handle [OK]
        Total Runs: 3,260
        Metric Samples: 5, Total: 16,300
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s
        
    
    io
    --
        Instance ID: io [OK]
        Total Runs: 3,261
        Metric Samples: 26, Total: 84,768
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 3ms
        
    
    load
    ----
        Instance ID: load [OK]
        Total Runs: 3,260
        Metric Samples: 6, Total: 19,560
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s
        
    
    memory
    ------
        Instance ID: memory [OK]
        Total Runs: 3,261
        Metric Samples: 17, Total: 55,437
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s
        
    
    network (1.6.1)
    ---------------
        Instance ID: network:2a218184ebe03606 [OK]
        Total Runs: 3,261
        Metric Samples: 32, Total: 104,340
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 1ms
        
    
    ntp
    ---
        Instance ID: ntp:b4579e02d1981c12 [OK]
        Total Runs: 3,261
        Metric Samples: 1, Total: 3,261
        Events: 0, Total: 0
        Service Checks: 1, Total: 3,261
        Average Execution Time : 31ms
        
    
    uptime
    ------
        Instance ID: uptime [OK]
        Total Runs: 3,261
        Metric Samples: 1, Total: 3,261
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s
        
========
JMXFetch
========

  Initialized checks
  ==================
    no checks
    
  Failed checks
  =============
    no checks
    
=========
Forwarder
=========

  CheckRunsV1: 3,260
  Dropped: 0
  DroppedOnInput: 0
  Errors: 81
  Events: 0
  HostMetadata: 0
  IntakeV1: 249
  Metadata: 0
  Requeued: 87
  Retried: 82
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 6,769
  TimeseriesV1: 3,260

  API Keys status
  ===============
    API key ending in 24fb5 for endpoint https://app.datadoghq.com: API Key valid

==========
Logs Agent
==========

  custom
  ------
    Type: docker
    Name: front-container
    Status: Pending
  
=========
DogStatsD
=========

  Checks Metric Sample: 533,936
  Event: 1
  Events Flushed: 1
  Number Of Flushes: 3,260
  Series Flushed: 420,505
  Service Check: 29,401
  Service Checks Flushed: 32,652


Describe what happened: Logs Agent dont collect logs, just pending.

Describe what you expected: Agent (v6.4.2) works as expected (installed on same server 4/9/2018 13:08)

  custom
  ------
    Type: docker
    Name: front-container
    Status: OK
    Inputs: 99b00c7d6467b686ce83333dfb86e5297cd20cd1810b99e0ac32dd218cadade1 

Steps to reproduce the issue: Agent (v6.4.2), Docker version 18.06.1-ce, build e68fc7a - working Agent (v6.5.2), Docker version 18.06.1-ce, build e68fc7a - not working

/etc/datadog-agent/conf.d/custom.yaml

logs:
  - type: docker
    name: front-container
    source: nginx
    service: docker

datadog.yaml

dd_url: https://app.datadoghq.com

api_key: a***5

hostname: k***s.com

tags:
  - role:shop-front

logs_enabled: true

Additional environment details (Operating System, Cloud provider, etc): Centos 7, Docker version 18.06.1-ce
Digital Ocean/Hostwindsdns - same problem

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 5
  • Comments: 21 (7 by maintainers)

Most upvoted comments

Also seeing a similar problem with v6.5.2. We are using Docker labels on our app containers to configure the Datadog container agent on hosts running Docker version 18.06.1-ce. Seems like Datadog is having a problem with parsing container labels, which didn’t change when we upgraded Datadog

[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (file.go:70 in Collect) | File: searching for configuration files at: /etc/datadog-agent/conf.d
[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (tailer.go:86 in Start) | Start tailing container: e***1
[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (file.go:70 in Collect) | File: searching for configuration files at: /opt/datadog-agent/bin/agent/dist/conf.d
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (file.go:74 in Collect) | Skipping, open /opt/datadog-agent/bin/agent/dist/conf.d: no such file or directory
[ AGENT ] 2018-09-27 22:54:20 UTC | INFO | (file.go:70 in Collect) | File: searching for configuration files at:
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (file.go:74 in Collect) | Skipping, open : no such file or directory
[ AGENT ] 2018-09-27 22:54:20 UTC | ERROR | (docker.go:126 in parseDockerLabels) | Can't parse template for container c***9: missing instances key
[ AGENT ] 2018-09-27 22:54:20 UTC | ERROR | (docker.go:126 in parseDockerLabels) | Can't parse template for container d***f: missing instances key
[ AGENT ] 2018-09-27 22:54:20 UTC | ERROR | (docker.go:126 in parseDockerLabels) | Can't parse template for container 8***5: missing instances key
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:250 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:251 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:276 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (disk).
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:250 in Configure) | could not get a check instance with the new api: __init__() takes at least 4 arguments (4 given)
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:251 in Configure) | trying to instantiate the check with the old api, passing agentConfig to the constructor
[ AGENT ] 2018-09-27 22:54:20 UTC | WARN | (check.go:276 in Configure) | passing `agentConfig` to the constructor is deprecated, please use the `get_config` function from the 'datadog_agent' package (network).

2018/10/05 additional info:

We are running our agent containers with the following environment variables:

DD_API_KEY=8***d
DD_DOGSTATSD_NON_LOCAL_TRAFFIC=true //to collect statsd from containers on host
DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
DD_LOGS_ENABLED=true
SD_BACKEND=docker

Example of a container service with Docker labels:

com.datadoghq.ad.check_names=["nginx"]
com.datadoghq.ad.init_configs=[{}]
com.datadoghq.ad.logs=[{"source": "nginx", "service": "my-service"}]

@btsuhako were you able to get ECS log collection working with any version of datadog-agent >=6.5.2?

I just started from scratch with 6.6.0 and was pulling my hair out on why I couldn’t get any logs shipped to Datadog until I found the notes in this issue. I downgraded to 6.4.2 and 💥 I suddenly had logs flowing to Datadog!

@NBParis - I’m not clear if this is a bug in the datadog-agent or just a (shared) misunderstanding of the current documentation.

Also, I see this in the 6.6.0 release notes:

Fix bug that occurs when checks labels/annotation are misconfigured and would prevent the logs of the container to be tailed

Is that bug fix related to this issue? Thanks!

Hello @jalessio ,

So it is indeed partially linked. The situation we had is that a badly formatted annotations/labels for metrics or logs was breaking the entire collection of data for that container. Now, logs and metrics annotations can fail independently and not block the other data type collection.

The next agent version 6.8 should solve all the issues raised in this thread.

ohh sorry it looks like the issue is solved… there was probably a wrong conf in the nginx image. thank you for the help !!

@btsuhako many thanks for the quick reply! I’ll try this out today.

@jalessio we’re successfully running the 6.7.0 agent on AWS Linux 2 hosts. Logging and APM work as expected.

Datadog Agent task definition -> https://gist.github.com/btsuhako/097a2e0d7932cca588cfcdcdf36dbb88

Sample ECS service task definition -> https://gist.github.com/btsuhako/33c1d3d6a2bbee52afa4cf92d3df1f6b

We build our Docker images without any labels, and apply the needed ones at runtime with the task definition. Note that we use only 1 label for our NodeJS application, and 4 labels for the nginx reverse proxy sidecar.

From @NBParis https://github.com/DataDog/datadog-agent/issues/2383#issuecomment-428104773, seems like you can use 1 label (com.datadoghq.ad.logs) or all 4 (com.datadoghq.ad.instances, com.datadoghq.ad.check_names, com.datadoghq.ad.init_configs, com.datadoghq.ad.logs), but anything else in between may not function properly.

Hello @undiabler, @btsuhako and @johanvereshchaga.

We have identified a potential issue which might explain the behaviour you observed. Is the Datadog Agent running on the host?

If yes, would you mind adding the following lines to datadog.yaml and let us know if after restarting the agent it then works fine:

listeners:
  - name: docker

config_providers:
  - name: docker
    polling: true

Why do we need this?

As explained in the previous post, the log collection was merged in the Autodiscovery feature of the Agent which means that we now need to have it enabled. The above lines enable the Autodiscovery feature in the agent.

This is not necessary when running the containerised version of the agent as it is enabled automatically.

Hello everyone,

Thanks for reporting this issue. We will definitely have a look, replicate and fix this behaviour.

There might however be some workaround until this is fixed. Indeed until the Agent version 6.5 it was required for Kubernetes to use configuration files to filter container by name or image.

As it is now possible to use the Autodiscovery feature with the agent, you can do the same configuration directly in container labels or pod annotations.

Examples: https://docs.datadoghq.com/logs/log_collection/docker/?tab=nginxdockerfile#examples

This means that you now have the ability to easily:

  • Collect all logs with the DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL environment variable.
  • Override the service and source value thanks to labels or pod annotations
  • Choose to collect only specific logs by removing the DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL variable and setting the labels or pod annotations on the container that should be collected.
  • Include or exclude containers thanks to the DD_AC_INCLUDE and DD_AC_EXCLUDE variables (example).

That said, previous configuration should still work so this definitely needs to be fixed. I just wanted to share the new behaviour which we believe is much more dynamic and flexible.

Sorry for the trouble caused and once again thanks for reporting it.