alertmanager: Large nflogs/silences can never be gossiped

One issue we need to deal with:

The internal queue of messages to gossip is currently unbounded, and only messages under a certain size can be sent (currently 1400 bytes (configurable) - msg_overhead (memberlist messages, usually a few byte)). Any messages that push the total gossip size past this “gossip size limit” remain in the queue and are attempted to be sent next time … but they can never be sent, because they’re too large.

Looking at our setup, we have some alerts that can greatly exceed this max size for a gossip message:

2018-06-12_12:02:03.97148 [DEBUG] memberlist: message not sent because over limit size_total=156746 overhead=3 size_used=0 size_msg=156743 limit=1398 position=1/2

These are just some internal log lines I added, but the important part is that overhead + size_used + size_msg needs to be LESS than limit. Because size_msg is way larger than limit, this will sit in the queue forever.

The messages slowly pile up:

2018-06-12_12:16:20.77173 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=12/13
2018-06-12_12:16:20.77174 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=11/13
2018-06-12_12:16:20.77175 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=10/13
2018-06-12_12:16:20.77176 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=9/13
2018-06-12_12:16:20.77177 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=8/13
2018-06-12_12:16:20.77178 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=7/13
2018-06-12_12:16:20.77179 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=6/13
2018-06-12_12:16:20.77185 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=5/13
2018-06-12_12:16:20.77186 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=4/13
2018-06-12_12:16:20.77189 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=3/13
2018-06-12_12:16:20.77190 [DEBUG] memberlist: message not sent because over limit size_total=156746 overhead=3 size_used=0 size_msg=156743 limit=1398 position=2/13
2018-06-12_12:16:20.77191 [DEBUG] memberlist: message not sent because over limit size_total=156746 overhead=3 size_used=0 size_msg=156743 limit=1398 position=1/13

Now the hard part … how do we solve this? Do we not gossip messages that are too large, and raise the limit? Do we hack up any messages that exceed our byte limit and send them piecemeal? Do we attempt to make a hash of hashes for “large” messages, and fall back to just doing a direct comparison? What do we set the limit to?

@brancz @fabxc @simonpasquier @mxinden @brian-brazil

Cross posted from #1340

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 26 (26 by maintainers)

Commits related to this issue

Most upvoted comments

Sorry I wasn’t clear enough. I was commenting on @simonpasquier’s suggestion to “keep the existing gossip processing for smaller messages”. I think we’re better off fully understanding and controlling one way of gossiping that we can improve when we can actually measure it, and my suggestion is to implement the memberlist.Transport with TCP for this.