logspout: logspout fails to reconnect after connection reset by peer.

I have 5 instances of logspout running in marathon talking to loggly. all was well until yesterday when all 5 failed with errors like this:

#   ADAPTER     ADDRESS                 CONTAINERS      SOURCES OPTIONS
#   syslog+tcp  logs-01.loggly.com:514                          map[]
2016/07/09 03:39:42 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection reset by peer
2016/07/09 03:39:42 syslog: retrying tcp up to 11 times
2016/07/09 03:40:23 syslog: retry failed
2016/07/09 03:40:23 syslog: reconnecting up to 11 times
2016/07/10 23:33:50 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/10 23:33:50 syslog: retrying tcp up to 11 times
2016/07/10 23:34:31 syslog: retry failed
2016/07/10 23:34:31 syslog: reconnecting up to 11 times
2016/07/11 02:04:47 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/11 02:04:47 syslog: retrying tcp up to 11 times
2016/07/11 02:05:28 syslog: retry failed
2016/07/11 02:05:28 syslog: reconnecting up to 11 times
2016/07/11 02:32:28 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/11 02:32:28 syslog: retrying tcp up to 11 times
2016/07/11 02:33:09 syslog: retry failed
2016/07/11 02:33:09 syslog: reconnecting up to 11 times
2016/07/12 02:57:56 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/12 02:57:56 syslog: retrying tcp up to 11 times
2016/07/12 02:58:37 syslog: retry failed
2016/07/12 02:58:37 syslog: reconnecting up to 11 times
2016/07/13 23:51:10 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: broken pipe
2016/07/13 23:51:10 syslog: reconnecting up to 11 times
2016/07/18 11:11:16 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: broken pipe
2016/07/18 11:11:16 syslog: reconnecting up to 11 times
2016/07/22 05:51:14 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: broken pipe
2016/07/22 05:51:14 syslog: reconnecting up to 11 times
2016/07/27 08:00:46 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection reset by peer
2016/07/27 08:00:46 syslog: retrying tcp up to 11 times
2016/07/27 08:01:27 syslog: retry failed
2016/07/27 08:01:27 syslog: reconnecting up to 11 times
2016/07/27 17:40:52 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/27 17:40:52 syslog: retrying tcp up to 11 times
2016/07/27 17:41:33 syslog: retry failed
2016/07/27 17:41:33 syslog: reconnecting up to 11 times
2016/07/28 00:56:24 syslog: write tcp xxx.xxx.xxx.xxx:34657->logglyip:514: write: connection timed out
2016/07/28 00:56:24 syslog: retrying tcp up to 11 times
2016/07/28 00:57:05 syslog: retry failed
2016/07/28 00:57:05 syslog: reconnecting up to 11 times

it looks like it retried with a backoff, but never succeeded in reconnecting, even though when I manually tried connecting it worked just fine.

restarting the dockers fixed the issue, so 2 things from this.

  1. something doesn’t work as expected with the retry mechanism.
  2. if there was a useful healthcheck endpoint in logspout, marathon could have shot/restarted the processes.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 5
  • Comments: 21 (3 by maintainers)

Commits related to this issue

Most upvoted comments

I think want we want is for the container to die when it can’t reconnect (ideally buffered but let’s not get ahead of ourselves) that way the scheduler can restart the container and/or notify someone when it’s not working

I did some more digging, and I found the following items that at least in my environment caused logspout to stop logging. This may or may not apply in your environment. We are working with Docker 1.8 (year old - apparently that is very old for docker 😉 ), so some of the points here, mostly regarding the JSON log driver might not be true in later versions.

  1. In https://github.com/gliderlabs/logspout/blob/master/router/pump.go#L351 there is a 1 second timeout for the send method. This means that if the reconnect takes more than one second, the log stream will be closed and from what I witnessed, it will stop the logs from that unlucky container.

  2. I observed a case where a container had a large log file (2GB - 4 days worth of logs), and when the tail option was set to all, and a since value was given, the docker daemon choked. I waited for the stream to start, and gave up after 15 minutes. Changing the initial tail from all to 0 solved this for me. On a stream disconnect (inactivity or other), I set the tail to 2000. Do note that it means that whenever logspout start it won’t digest any backlog, even if that option is set. It also means that on any reconnection of the docker stream (docker stream, not syslog), it will only look 2000 events back.

  3. Similar to previous issue, I encountered an issue with tail set to a large value (I tried 6000), the docker daemon might fail parsing the JSON logs. See https://github.com/docker/docker/issues/29511 - I was told to try with newer Docker version, but that is not possible for me ATM.

  4. I am no Go expert, but it seems to me that the retry mechanism in https://github.com/gliderlabs/logspout/blob/master/adapters/syslog/syslog.go#L104 will not try re- sending the log in the iteration if the error is not “temporary”. According to my observations, disconnect is not considered temporary. I fixed it by resending the log after reconnect() - https://github.com/imriz/logspout/blob/5db6a9fbc9d77e4635be62e86b7f98e1aa9d80bc/adapters/syslog/syslog.go#L143

Anyway, I fixed all these issue at my fork, and it is present in my docker image. I am running for about 24 hours without issues (before I had 3-4 incidents per day). If you use it, YMMV and please understand the consequences of my changes.

for the benefit of everyone affected by this (huge) bug, and due to the lack of activity in this issue - I’ve forked @robcza fork, which implements both the configurable retry, and the exit on failure, and created a public image in docker hub: https://hub.docker.com/r/imriz/logspout/