aws-for-fluent-bit: Exit code 255 on 2.24.0 release

Describe the question/issue

ECS task is not making it past the pending stage, with the fluent bit container exiting with a 255 status code.

This is only happening with 2.24.0, not 2.23.4.

Configuration

{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::xxxx:role/Execution-Role",
  "containerDefinitions": [
    {
      // Other container definitions here
    },
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": null,
      "entryPoint": null,
      "portMappings": [],
      "command": null,
      "linuxParameters": null,
      "cpu": 0,
      "environment": [],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": null,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "906394416424.dkr.ecr.us-east-1.amazonaws.com/aws-for-fluent-bit:latest",
      "startTimeout": null,
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "config-file-type": "file",
          "enable-ecs-log-metadata": "true",
          "config-file-value": "/fluent-bit/configs/parse-json.conf"
        }
      },
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": null,
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": "0",
      "readonlyRootFilesystem": null,
      "dockerLabels": null,
      "systemControls": null,
      "privileged": null,
      "name": "log_router"
    }
  ],
  "placementConstraints": [],
  "memory": "2048",
  "taskRoleArn": null,
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:us-east-1:xxxx:task-definition/xxxx:123",
  "family": "xxxx",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.firelens.fluentbit"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.firelens.options.config.file"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.secrets.asm.environment-variables"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awsfirelens"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.secrets.asm.bootstrap.log-driver"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "runtimePlatform": null,
  "cpu": "1024",
  "revision": 123,
  "status": "INACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": [],
  "statusString": "(INACTIVE)"
}

Fluent Bit Log Output

I was unable to obtain logs from the container, as it crashed.

Fluent Bit Version Info

This has been an issue on latest and 2.24.0, but was not an issue with stable or 2.23.4.

Cluster Details

ECS fargate, VPC endpoints, sidecar deployment.

Private network with API gateway to the outside world.

Application Details

At startup, the service produces ~10 logs in the first second or two.

Steps to reproduce issue

  • Deploy 2.24.0
  • Observe that the task is stuck in a pending state, with the fluent bit container exiting 255

I have observed a rollback to 2.23.4 successfully being deployed.

Related Issues

None that I could find

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 29 (11 by maintainers)

Most upvoted comments

Below are the valid keys for datadog

Host, TLS, compress, apikey, Proxy, provider, json_date_key, include_tag_key, tag_key, dd_service, dd_source, dd_tags, dd_message_key. Any other key will result in error. if you need to add env tags then it can be part of dd_tags

we got this as well, we switched to the stable tag from latest to get it working again.

@bimp Seems the answer is the same as for the systemd plugin (see farther back in the comment stream on this issue), config validation was missing previously and was only just added: https://github.com/fluent/fluent-bit/commit/fbe829eff4c348fd297e3c7be04103d471224dba

@PettitWesley further log hunting revealed that you’re probably right:

[2022/05/11 17:00:31] [ info] [fluent bit] version=1.9.3, commit=9eb4996b7d, pid=1
[2022/05/11 17:00:31] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/05/11 17:00:31] [ info] [cmetrics] version=0.3.1
[2022/05/11 17:00:31] [error] [lib] backend failed
[2022/05/11 17:00:31] [ info] [input:forward:forward.0] listening on unix:///var/run/fluent.sock
[2022/05/11 17:00:31] [ info] [input:forward:forward.1] listening on 127.0.0.1:24224
[2022/05/11 17:00:31] [ info] [input:tcp:tcp.2] listening on 127.0.0.1:8877
[2022/05/11 17:00:31] [error] [config] record_modifier: unknown configuration property 'Reserve_Data'. The following properties are allowed: record, remove_key, allowlist_key, and whitelist_key.
[2022/05/11 17:00:31] [ help] try the command: /fluent-bit/bin/fluent-bit -F record_modifier -h
[2022/05/11 17:00:31] [ info] [input] pausing forward.0
AWS for Fluent Bit Container Image Version 2.24.0

question is why did this happen now? I’ve had this incorrect option forever. It completely blocked the Fargate service from running so I’m curious why it is now failing so catastrophically.

@albertschwarzkopf This one is fun, it seems that in previous versions, the config for systemd input was not actually validated, thus it was possible to input options that don’t exist: https://github.com/fluent/fluent-bit/commit/773581f8e96b10c7b9fa30224d262382b763a3c7

In previous versions I’m able to run that input with all sorts of random fake keys added.

https://docs.fluentbit.io/manual/pipeline/inputs/systemd

I think you need to use the filter parser with your parser to parse these logs: https://docs.fluentbit.io/manual/pipeline/filters/parser

I can confirm this is happening as well. Caused a lot of confusion and crashes to all our services yesterday 😅