fluent-bit: out_cloudwatch: Segmentation fault

Bug Report

Describe the bug

fluent-bit immediately crashes with a segmentation fault when cloudwatch_logs output is used on ARM64.

To Reproduce

Configure fluent-bit to output to cloudwatch_logs.

[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    DB                /var/log/flb_kube.db
    Parser            docker
    Docker_Mode       On
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc.cluster.local:443
    Merge_Log           On
    Merge_Log_Key       data
    Keep_Log            On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
    Buffer_Size         32k

[OUTPUT]
    Name loki
    Match *
    host loki-gateway.monitoring.svc.cluster.local
    port 80
    auto_kubernetes_labels false
    labels job=fluent-bit
    label_map_path /etc/labelmap.json
    remove_keys kubernetes

[OUTPUT]
    Name                  cloudwatch_logs
    Match                 *
    region                eu-west-1
    log_group_name        /aws/eks/fluentbit-cloudwatch/logs
    log_stream_prefix     fluentbit-
    log_retention_days    30
    auto_create_group     true

The following error is logged on startup and fluent-bit crashes:

[2022/12/23 14:06:57] [ info] [fluent bit] version=2.0.8, commit=9444fdc5ee, pid=1                                                                                                            
[2022/12/23 14:06:57] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128                                                                                  
[2022/12/23 14:06:57] [ info] [cmetrics] version=0.5.8                                                                                                                                        
[2022/12/23 14:06:57] [ info] [ctraces ] version=0.2.7                                                                                                                                        
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] initializing                                                                                                                                
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)                                                                                                     
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc.cluster.local port=443                                                                     
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0]  token updated                                                                                                                 
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0] local POD info OK                                                                                                              
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...                                                                                        
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0] connectivity OK                                                                                                                
[2022/12/23 14:06:57] [ info] [output:loki:loki.0] configured, hostname=loki-gateway.monitoring.svc.cluster.local:80                                                                          
[2022/12/23 14:06:57] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] worker #0 started                                                                                                    
[2022/12/23 14:06:57] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020                                                                                                                
[2022/12/23 14:06:57] [ info] [sp] stream processor started
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528997 watch_fd=2 name=/var/log/containers/aws-load-balancer-controller-5bfbd747bd-59v4h_kube-system_aws-load-balanc
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=134898 watch_fd=3 name=/var/log/containers/aws-load-balancer-controller-5bfbd747bd-vwxlb_kube-system_aws-load-balanc
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=134905 watch_fd=4 name=/var/log/containers/aws-load-balancer-controller-5bfbd747bd-vwxlb_kube-system_aws-load-balanc
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=266650 watch_fd=5 name=/var/log/containers/aws-node-bndp7_kube-system_aws-node-25e1ce4793370ef1b2386200c27bc6c0f2007
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=266644 watch_fd=6 name=/var/log/containers/aws-node-bndp7_kube-system_aws-vpc-cni-init-1798466d7955c1f09fa5be21ef65c
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=134906 watch_fd=7 name=/var/log/containers/cert-manager-cainjector-86f7f4749-kfndw_cert-manager_cert-manager-cainjec
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=134907 watch_fd=8 name=/var/log/containers/cert-manager-cainjector-86f7f4749-kfndw_cert-manager_cert-manager-cainjec
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=269677 watch_fd=9 name=/var/log/containers/cert-manager-webhook-66c85f8577-lhbfz_cert-manager_cert-manager-webhook-9
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=266446 watch_fd=10 name=/var/log/containers/cluster-proportional-autoscaler-coredns-6ccfb4d9b5-8v892_kube-system_clu
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=266427 watch_fd=11 name=/var/log/containers/coredns-68b4d7cb47-sk5lc_kube-system_coredns-65f7eea6cd0dd159aa4bcba818a
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529055 watch_fd=12 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_csi-attacher-5c87844421e8
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529031 watch_fd=13 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_csi-provisioner-ee9da4c02
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529104 watch_fd=14 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_csi-resizer-b8d5c1b1e0569
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529078 watch_fd=15 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_csi-snapshotter-280f7d129
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528990 watch_fd=16 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_ebs-plugin-09adedebe40a79
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=530317 watch_fd=17 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_liveness-probe-5631541006
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528598 watch_fd=18 name=/var/log/containers/ebs-csi-node-bzvwg_kube-system_ebs-plugin-5be936e176dc41bee8d7cfb3d9a358
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528843 watch_fd=19 name=/var/log/containers/ebs-csi-node-bzvwg_kube-system_liveness-probe-ba8c6f18f5865fcaf56b061e93
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528725 watch_fd=20 name=/var/log/containers/ebs-csi-node-bzvwg_kube-system_node-driver-registrar-7ebb2d392e1c655e89e
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528542 watch_fd=22 name=/var/log/containers/fluent-bit-gkclz_monitoring_fluent-bit-b40b1d6a925d1bbfa8aae336fa00b37c8
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=7765 watch_fd=32 name=/var/log/containers/kube-proxy-qwmsq_kube-system_kube-proxy-8e9f8234db195425e6db081745b8fe6d57
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528530 watch_fd=33 name=/var/log/containers/fluent-bit-gkclz_monitoring_fluent-bit-5df98c2f5e66a525ba1cd91b5ddcb6e2c
[2022/12/23 14:06:57] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Creating log stream fluentbit-kube.var.log.containers.fluent-bit-gkclz_monitoring_fluent-bit-5df98c2f5e66a525ba1cd91b
[2022/12/23 14:06:57] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Created log stream fluentbit-kube.var.log.containers.fluent-bit-gkclz_monitoring_fluent-bit-5df98c2f5e66a525ba1cd91b5
[2022/12/23 14:06:57] [engine] caught signal (SIGSEGV)                                                                                                                                        
#0  0xaaaab52df2b0      in  template_execute() at lib/msgpack-c/include/msgpack/unpack_template.h:172                                                                                         
#1  0xaaaab52e0f6b      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:677                                                                                                           
#2  0xaaaab4c7ab0b      in  process_and_send() at plugins/out_cloudwatch_logs/cloudwatch_api.c:859                                                                                            
#3  0xaaaab4c741bf      in  cb_cloudwatch_flush() at plugins/out_cloudwatch_logs/cloudwatch_logs.c:417                                                                                        
#4  0xaaaab48a860f      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:527                                                                                                      
#5  0xaaaab52fc0ef      in  co_switch() at lib/monkey/deps/flb_libco/aarch64.c:133                                                                                                            
#6  0xffffffffffffffff  in  ???() at ???:0                                                                                                                                                    
Stream closed EOF for monitoring/fluent-bit-gkclz (fluent-bit)

Expected behavior

The application should successfully start and start logging to CloudWatch.

Your Environment

  • Version used: 2.0.8
  • Configuration: See above.
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes v1.24.7-eks-fb459a0
  • Server type and version: arm64
  • Operating System and version: Bottlerocket OS 1.11.1 (aws-k8s-1.24)
  • Filters and plugins: kubernetes, cloudwatch_logs

Additional context

There is another issue logged with a similar error. My expectation was that this would be resolved by version 2.0.8 that includes the fix, but it seems that is not the case. See https://github.com/fluent/fluent-bit/issues/6451

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 24 (12 by maintainers)

Most upvoted comments

@cdancy actually, the issue seems to be cause simply by having any two outputs match the same tag. We’ve also discovered another issue on 1.9 (not sure if it applies to 2.0) which is much more likely to occur if you have net.keepalive On, the default.

The stack trace in this issue matches the “two outputs match same tag” issue we are still investigating.

Sorry that’s all I can say right now. Will post more when I have more clear info.

I want to note here that we think we’ve found a potential issue when you have multiple cloudwatch_logs outputs matching the same tag. Not sure which versions this impacts yet.

@cdancy AWS distro is focused on AWS customers and thus yes, is always the most up to date distro in terms of fixes for AWS customers.

We don’t seem to need all the bells and whistles the aws-for-fluent-bit project uses

We’re distributing the same code, the only thing we add is the old AWS go plugins. What do you mean? https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#aws-go-plugins-vs-aws-core-c-plugins

@cdancy I have a pre-release build of S3 concurrency fix here: https://github.com/aws/aws-for-fluent-bit/issues/495#issuecomment-1356478537

Keep checking AWS distro, we will release it very very soon.

If you can repro it with an AWS distro version and can open an issue at our repo, that will also help with tracking

@cdancy thanks for letting me know.

I see this:

[2023/01/27 20:20:31] [debug] [input chunk] tail.0 is paused, cannot append records
[2023/01/27 20:20:31] [debug] [input chunk] tail.0 is paused, cannot append records
#0  0x55e4e178aaf5      in  template_execute() at lib/msgpack-c/include/msgpack/unpack_template.h:172
#1  0x55e4e178c59b      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:677
#2  0x55e4e112ff9e      in  process_and_send() at plugins/out_cloudwatch_logs/cloudwatch_api.c:859
#3  0x55e4e1129717      in  cb_cloudwatch_flush() at plugins/out_cloudwatch_logs/cloudwatch_logs.c:417
#4  0x55e4e0d6b267      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:527
#5  0x55e4e17a7266      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

Similar to the trace already in this issue. I am wondering if its related to the S3 concurrency issues we recently discovered: https://github.com/fluent/fluent-bit/pull/6573

This will be released in AWS4FB 2.31.0 very soon, not sure when it will get in this distro: https://github.com/aws/aws-for-fluent-bit/releases