kubernetes: fluentd-gcp crashing because of `JournalError: Bad message`

This happened in a 1.6 test: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-serial-release-1-6/929

fluentd crashed/restarted ~40 times during the test because it could not handle the bad message in journal.

2017-11-14 15:40:52 +0000 [error]: unexpected error error_class=Systemd::JournalError error=#<Systemd::JournalError: Bad message>
  2017-11-14 15:40:52 +0000 [error]: /var/lib/gems/2.1.0/gems/systemd-journal-1.2.3/lib/systemd/journal.rb:284:in `enumerate_helper'
  2017-11-14 15:40:52 +0000 [error]: /var/lib/gems/2.1.0/gems/systemd-journal-1.2.3/lib/systemd/journal.rb:106:in `current_entry'
  2017-11-14 15:40:52 +0000 [error]: /var/lib/gems/2.1.0/gems/fluent-plugin-systemd-0.0.8/lib/fluent/plugin/in_systemd.rb:88:in `watch'
  2017-11-14 15:40:52 +0000 [error]: /var/lib/gems/2.1.0/gems/fluent-plugin-systemd-0.0.8/lib/fluent/plugin/in_systemd.rb:70:in `run'
2017-11-14 15:40:52 +0000 [info]: shutting down fluentd

This issue fixed in fluentd-0.14.x according to https://github.com/reevoo/fluent-plugin-systemd/issues/16 Not sure if the latest fluentd-gcp image contains the fix or not.

/cc @crassirostris

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

Had this problem today on 3 of my 6 nodes.

Node version 1.8.4-gke.0 fluentd: gcr.io/google-containers/fluentd-gcp:2.0.9

2017-12-12 18:43:50 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-12-12 18:43:50 +0000 [warn]: 'block' action stops input process until the buffer full is resolved. Check your pipeline this action is fit or not
2017-12-12 18:43:51 +0000 [error]: unexpected error error_class=Systemd::JournalError error=#<Systemd::JournalError: Bad message>
  2017-12-12 18:43:51 +0000 [error]: /var/lib/gems/2.1.0/gems/systemd-journal-1.3.0/lib/systemd/journal.rb:300:in `enumerate_helper'
  2017-12-12 18:43:51 +0000 [error]: /var/lib/gems/2.1.0/gems/systemd-journal-1.3.0/lib/systemd/journal.rb:106:in `current_entry'
  2017-12-12 18:43:51 +0000 [error]: /var/lib/gems/2.1.0/gems/fluent-plugin-systemd-0.0.8/lib/fluent/plugin/in_systemd.rb:88:in `watch'
  2017-12-12 18:43:51 +0000 [error]: /var/lib/gems/2.1.0/gems/fluent-plugin-systemd-0.0.8/lib/fluent/plugin/in_systemd.rb:70:in `run'
2017-12-12 18:43:51 +0000 [warn]: unexpected error while shutting down input plugin plugin=Fluent::SystemdInput plugin_id="object:3f9d8556d5e4" error_class=Systemd::JournalError error=#<Systemd::JournalError: Bad message>
  2017-12-12 18:43:51 +0000 [warn]: /var/lib/gems/2.1.0/gems/systemd-journal-1.3.0/lib/systemd/journal.rb:300:in `enumerate_helper'
  2017-12-12 18:43:51 +0000 [warn]: /var/lib/gems/2.1.0/gems/systemd-journal-1.3.0/lib/systemd/journal.rb:106:in `current_entry'
  2017-12-12 18:43:51 +0000 [warn]: /var/lib/gems/2.1.0/gems/fluent-plugin-systemd-0.0.8/lib/fluent/plugin/in_systemd.rb:88:in `watch'
  2017-12-12 18:43:51 +0000 [warn]: /var/lib/gems/2.1.0/gems/fluent-plugin-systemd-0.0.8/lib/fluent/plugin/in_systemd.rb:70:in `run'

I did fix the problem by deleting the bad messages. I did the following on each node that had errors.

First I identifed the bad messages

#sudo journalctl --verify
63ffea0: Unused data (entry_offset==0)
PASS: /var/log/journal/463a3098948b621616861afda4dce21d/system@220d223d0fbf42b0a0664308443f876a-000000000023161c-00056024bb72f6b1.journal
63ffe88: Unused data (entry_offset==0)
PASS: /var/log/journal/463a3098948b621616861afda4dce21d/system@220d223d0fbf42b0a0664308443f876a-0000000000249cdb-000560260cf13d71.journal
63ffe40: Unused data (entry_offset==0)
PASS: /var/log/journal/463a3098948b621616861afda4dce21d/system@220d223d0fbf42b0a0664308443f876a-00000000001e81c6-00056020c60abe3b.journal
63ffe80: Unused data (entry_offset==0)
PASS: /var/log/journal/463a3098948b621616861afda4dce21d/system@220d223d0fbf42b0a0664308443f876a-00000000001b743a-0005601e227b6eef.journal
PASS: /var/log/journal/463a3098948b621616861afda4dce21d/system@220d223d0fbf42b0a0664308443f876a-00000000001cfb00-0005601f74498661.journal
PASS: /var/log/journal/463a3098948b621616861afda4dce21d/system@57f462172e3d43d9a3e26cbdf0a6b61f-0000000000000001-0005602822055099.journal
30f3b40: Invalid data object at hash entry 12 of 182044
File corruption detected at /var/log/journal/463a3098948b621616861afda4dce21d/system.journal:30f39d0 (of 58720256 bytes, 87%).
FAIL: /var/log/journal/463a3098948b621616861afda4dce21d/system.journal (Bad message)
PASS: /var/log/journal/463a3098948b621616861afda4dce21d/system@220d223d0fbf42b0a0664308443f876a-0000000000218f59-00056023699fa9a9.journal
319df48: Invalid entry item (8/19 offset: 000000
319df48: Invalid object contents: Bad message
File corruption detected at /var/log/journal/463a3098948b621616861afda4dce21d/system@00056028225162eb-74f90f9f6ab6117d.journal~:319df48 (of 58720256 bytes, 88%).
FAIL: /var/log/journal/463a3098948b621616861afda4dce21d/system@00056028225162eb-74f90f9f6ab6117d.journal~ (Bad message)
63ffeb8: Unused data (entry_offset==0)
PASS: /var/log/journal/463a3098948b621616861afda4dce21d/system@220d223d0fbf42b0a0664308443f876a-000000000019ed73-0005601cd0f5add8.journal
PASS: /var/log/journal/463a3098948b621616861afda4dce21d/system@220d223d0fbf42b0a0664308443f876a-0000000000200891-0005602217c55738.journal

Then i moved the offending files (I tend to mv before delete)

sudo mv /var/log/journal/463a3098948b621616861afda4dce21d/system.journal /var/log/journal/463a3098948b621616861afda4dce21d/system.journal.bak

sudo mv /var/log/journal/463a3098948b621616861afda4dce21d/system@00056028225162eb-74f90f9f6ab6117d.journal~ /var/log/journal/463a3098948b621616861afda4dce21d/system@00056028225162eb-74f90f9f6ab6117d.journal~.bak

And then finally I restarted the systemd-journald.service sudo systemctl restart systemd-journald

Seeing the same thing

Node version 1.8.6-gke.0 fluentd image: gcr.io/google-containers/fluentd-gcp:2.0.9