faas-netes: Support request for 404 errors whilst scaling the gateway

Expected Behaviour

When a new gateway pod is created in a HA environment we expect the endpoint lister to be able to find all the function deployments without problems.

Current Behaviour

In a HA gateway deployment with 4 gateways we sometimes see 1 of the 4 gateways return 404s (Not Found) consistently for some fraction of the calls in the faas-netes container. This is most likely to happen right after the new gateway pod is created. We see about 99% of our errors come from this line and 1% coming from this line.

We run with a setup where function pods are being terminated within 10 seconds of having a call completed unless other connections from a gateway is opened.

A typical scenario will look as follows a new gateway is created and we see that some fraction of calls are giving 404s Screen Shot 2021-01-27 at 11 56 29 AM In the above pic, new gateways were spawned right after 9.40am when the 404s start to show up.

We have also set gateway.directFunctions=false and run all calls asynchronously, through the nats queue and the queueworkers. So if my interpretation of the call path is correct it should be: Gateway -> faas-netes -> nats-queue -> queue-worker -> Function pod (correct?) We are seeing the 404 in the faas-netes container so I suspect the nats queue is innocent.

Possible Solution

We’re not sure how to solve this. I suspect something is corrupting the cache of the k8s client, perhaps because function pods are going in and out of existence so fast. Then when we ask for the entry in the cache here, it wont be successful. But I would expect then all gateways to be equally affected, and not just one.

Steps to Reproduce (for bugs)

Unfortunately, this bug is difficult to reproduce locally due to the complex call pattern that we have. One might be able to reproduce this by creating a scenario where pods are being called at frequencies less than 1s.

Context

This issue is vital to solve due to the unexpected nature of the 404.

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ): NA
Docker version: containerd://1.3.2g
Kubernetes version v1.17.14-gke.400
Operating System and version (e.g. Linux, Windows, MacOS): GKE linux Container-Optimized OS from Google
Link to your project or a code example to reproduce issue: NA

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 37 (31 by maintainers)

Commits related to this issue

Sync endpoints before starting HTTP server The HTTP server which is used for CRUD and invocations should not be started until the cached informers for endpoints is ready. This may be related to issue... — committed to openfaas/faas-netes by alexellis 3 years ago
Sync endpoints before starting HTTP server The HTTP server which is used for CRUD and invocations should not be started until the cached informers for endpoints is ready. This may be related to issue... — committed to openfaas/faas-netes by alexellis 3 years ago

Most upvoted comments

Thank you for the extra information.

This sounds like a very different issue. Cognite are pushing very low volumes of data and use a custom fork. You are trying to find the breaking point in openfaas on GKE.

This looks like something we would need to debug with you on a high touch engagement. Let’s follow up on email?

I would suggest a separate issue but I feel this is very specific to each company rather than a generic situation. If we can find an issue by working with either company then a generic issue would make sense along with whatever else we need to get a resolution.

In the interim, if you have anything else we could use to look into the problem, feel free to share it here.

alexellis on Feb 3, 2021