splunk-connect-for-kubernetes: Exception stack trace logs getting split into multiple lines (events) when seen on splunk

What happened: We have been migrating application from EC2 based deployments to kubernetes based deployments. We are seeing the logs get split when we view on splunk. Example a java exception stack trace is multiple events on splunk. This never used to happen previously when we had it hosted on ec2(not containers) What you expected to happen: Entire exception stack trace should be part of single log on splunk How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: This is on EKS with Docker. Environment:

  • Kubernetes version (use kubectl version): we are on k8s 1.14
  • Ruby version (use ruby --version):
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Splunk version:
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 20 (2 by maintainers)

Most upvoted comments

thanks @matthewmodestino. I think we have found our issue now. Also picked up the chat in Slack regarding the timestamp parsing:

time_format was slightly different than SCK though, we were using time_format %Y-%m-%dT%H:%M:%S.%NZ so we switched to keeping time and removing time_format added a filter using the filter_with_time method with essentially

time = Fluent::EventTime.from_time(Time.iso8601(record[‘time’])) record.delete(‘time’)

it reduced CPU utilization quite a bit and improved our throughput

@rockb1017 have you seen that thread where the user was talking about the differences in performance when doing timestamp parsing?

Right. I’m trying to find where I saw this, I believe it was a previous issue here… but it boils down to this, if I am remembering precisely:

https://github.com/splunk/splunk-connect-for-kubernetes/blob/fbcd160064f655f10372b22bfedf8ac3ae83464f/helm-chart/splunk-connect-for-kubernetes/charts/splunk-kubernetes-logging/templates/configMap.yaml#L89-L100

 {{- if eq .Values.containers.logFormatType "cri" }}
        @type regexp
        expression /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
        time_format  {{ .Values.containers.logFormat | default "%Y-%m-%dT%H:%M:%S.%N%:z" }}
      {{- else if eq .Values.containers.logFormatType "json" }}
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      {{- end }}
        time_key time
        time_type string
        localtime false
      </parse>

I recall a customer updating the regex to extract the flag, then implemented something like this, by updating our charts:

<source>
  @type tail
  path /var/log/containers/*.log
  <parse>
    @type regexp
    expression /^(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z) (?<output>\w+) (?<partial_flag>[FP]) (?<message>.+)$/
  </parse>
  tag k8s
  @label @CONCAT
</source>

<label @CONCAT>
  <filter k8s>
    @type concat
    key message
    partial_key partial_flag
    partial_value P
  </filter>
  <match k8s>
    @type relabel
    @label @OUTPUT
  </match>
</label>

<label @OUTPUT>
  <match>
    @type stdout
  </match>
</label>

However, I have seen chat in the usergroups slack channel about performance when time is extracted at various places…one customer was saying they got big throughput gains based on what was doing the timestamp parsing in fluentd.

oc may add some magic, as it is a bin reading and parsing cri-o log for humans. please grab raw log from cri-o log. Afaik, oc doesnt show cri-o partial flag.

Beyond cri-o, i dont see other limits in fluentd that would snip it…I don’t believe fluentd-hec will slice up payloads.