seneca-mesh: problems with services losing connection to mesh

Apologies, this a bit long but is the culmination of quite a lot of pain & struggling and debugging with a pretty fatal problem.

We’ve got a bit of a problem, I’ve been trying myself to debug it but I’m basically at wit’s end trying to figure out what the problem is. I have a suspicion it relates to too many pins on mesh and timing issues when a service is actually ready to receive ping requests from swim and those pings failing because it’s not ready.

When we started seeing this, the symptoms were that services in the mesh would lose connections. Specifically, we have an auth service that wraps seneca-user and on the front end, an auth pattern is hit on almost every page visit. We would see the front-end service unable to find the auth pattern and a failure with act_not_found. We’re still under development so the resolution is to kill everything & start it up again.

The hope was that this was an isolated thing that we would only see during development - but now we’re looking to get this running outside of local developers machines, using docker and rancher to spin up all the microservices and we’re seeing it there now, when upgrading nodes to the latest version (basically stop, pull latest image, start) and when scaling to multiple instances.

I discovered I could clearly see the problem if I pass the following option on the base service,

seneca.use(SenecaMesh, {
  // etc...
  balance_client: {debug: {client_updates: true}}
}

With this present, we can see when nodes are added and when they are removed. When the symptoms above show things go, for lack of a better word - haywire. This log goes insane with a few messages per second, both add and remove showing pins from all microservices in the mesh.

It spreads like a virus it seems - starting with one node and spreading to others. When it gets in this state, the only way I’ve seen to resolve it is to kill everything and start it up. When this happens locally running everything (10 microservices), the system slows to a crawl… in rancher, nodes go unhealthy, are removed & re-added and it’s basically a perpetual reboot loop.

Our sample configuration for a service looks something like the following:

const seneca = Seneca({
  tag: '...',
  transport: {host: IP_ADDRESS},
})

seneca.use(SenecaMesh, {
  host: IP_ADDRESS,
  bases: [`${MESH_HOST_BASE}:${MESH_HOST_PORT}`],
  listen: ['array of pin strings'].map(pin => ({host: IP_ADDRESS})
})

The liberal use of IP_ADDRESS is to get it working with docker/rancher. The base node is the only one to use a host name because (for now) there is only one of them. The networking is handled by rancher here and we get access to related services by host name. In practice, this falls down when there are multiple services all using the same name, it appends numbers… so we just use IP address in the config

I’ve omitted pins, but in practice on each service there are 4 model:observe pins for cache clears and service startup notifications, and each service exposes ~7-9 pins in addition to this. The front-end one has an added model:observe for route:set for seneca-web. Each service has a health check setup with a pre-defined port - rancher uses this to perform health checks.

Other notes and things I’ve noticed in debugging this (fruitlessly):

  • if a service calls into another while it is booting up, things are far more likely to break. We got around this locally by a liberal usage of timeout in the fuge config - and in rancher, by a clever startup script that will wait for the healthcheck of dependant services to be available before attempting to start.

  • the model:observe strangely seems to cause the problem more often. After we added service number 10, we weren’t able to start up the project at all without hitting this. On the front-end service after receiving the route payload from seneca-web (3 services emit their own routes, including number 10) we’d see a remove/add for pins on the front-end. It seems the cache model:observe would cause it to spin out of control here… commenting those out, we can at least get it started… it still removes/adds route:set and the healthcheck, but it doesn’t spin out of control.

  • looking through the callstack on break points, it seems that nodes are marked as faulty by swim, make their way to sneeze and finally to mesh. Why this is happening is still an unknown to me and may be the underlying cause of the problem - the services should be alive and kicking by the time swim starts polling them, right?

  • we are running with seneca.fixedargs.fatal$ = false on all services. This is so we can pass back errors in seneca actions without killing the services. I tried removing this to see if things would die properly but this was not the case. There are no errors raised throughout this, just a lot of network traffic and degraded nodes.

I have a feeling this can be worked around by providing options to swim by way of sneeze options, but for the life of me, I don’t know what the best options are, nor do I know if this is just a stopgap until we add more pins/services and need to increase timeouts again.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 19 (6 by maintainers)

Most upvoted comments

I’ll have to take a look at this monitoring option - that looks incredibly useful, thanks @rjrodger

In terms of debugging I would say when a service loses a node in the mesh, log a warning. Without balance_client: {debug: {client_updates: true}} in the options, there is no way to see something is wrong unless one is looking at top and seeing the CPU pinned, or actions returning with act_not_foundclient_updates can be a bit chatty and I really only have it on the base node… it would be nice to see specifically “mesh pin xxx has been marked faulty” on the node that marked it faulty without passing additional options.

Also docs and/or best practices would be incredibly helpful… maybe a flashing, blinking marquee on the readme that says “if you use this in a non-trivial project, you will need to configure swim for your network”… As it stands, there’s no information about sneeze/swim opts… I had to dig into the code to figure out that I could even pass options along to sneeze/swim, then needed to dig further to find out what the options were… and even further into the swim paper to find out what it all meant. Maybe a advanced swim for dummies page on the wiki or something.