faas-netes: Gateway Pod crashes when Profiles CRD is deleted

My actions before raising this issue

Followed the troubleshooting guide
Read/searched the docs
Searched past issues

@alexellis Posting here and plan to edit and follow-up, so that this issue doesn’t get lost.

Expected Behaviour

I don’t think that this should crash the gateway, but I do think an error should be logged.

Current Behaviour

Gateway crashes if you try to create a function with a profile; this seems initially related to using the operator and CRD flags set to true on the faas-netes/openfaas helm chart.

Are you a GitHub Sponsor (Yes/No?)

Check at: https://github.com/sponsors/openfaas

Yes
No
No, but I sponsor Alex

List All Possible Solutions and Workarounds

Which Solution Do You Recommend?

Steps to Reproduce (for bugs)

Context

Crashing our gateway instances rather than simply not creating functions.

FaaS-CLI version ( Full output from: faas-cli version ): 0.13.13
Docker version docker version (e.g. Docker 17.0.05 ): 20.10.8
Which deployment method do you use?:

OpenFaaS on Kubernetes
faasd

server v1.21.3 client v1.22.2

Operating System and version (e.g. Linux, Windows, MacOS): MacOS
Code example or link to GitHub repo or gist to reproduce problem:
Other diagnostic information / logs from troubleshooting guide

Next steps

You may join Slack for community support.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21 (20 by maintainers)

Most upvoted comments

I plan on doing a full repro and updating this issue to whatever I identify to be the suspected problem; I do not intend to let this stale, just juggling over here.

kevin-lindsay-1 on Nov 2, 2021

@alexellis i can not reproduce the issue as described:

I can reproduce the original bug that causes the gateway pod to crash when the Profile CRD is missing.

Regarding the bug I can reproduce. I can ensure that the Gateway starts by checking for the Profile CRD during startup. However, now we need to discuss error edge cases.

When using the controller (ie classic faas-netes), I can also add errors to the API when it sees Profiles trying to be used in a cluster that has not enabled them. These seems fine, but we do have error cases that can happen when the CRD is deleted after startup. If we want to be very very safe, i could check for the CRD on every function deploy/update request, but those deploys will also fail with a error anyway, so i am not sure the extra check is really needed. We should probably change some of the logic so that the profile client is only used when the current deployment or the current request reference Profiles.

The operator has the same problem, but with a twist: we don’t currently have any kind of validation webhook/controller, we need to handle Function objects that reference profiles even though the cluster doesn’t support it.

I see four options

allow all functions to be deployed, completely ignoring Profiles, but simply log a warning when we see functions requesting profiles. I am not a fan of this, it could lead to bugs that are very hard for people to debug, I am only including the option for completeness.
create a validation webhook endpoint in the controller so that we can return validation messages for CRD users as well, we can then just mimic the same behavior as the classic faas-netes.
make the Profile CRD a hard requirement, the Operator should crash with a clear message and tell the admin to deploy both Function and Profile CRDs, we can add this to the helm chart install flow as well. But this approach doesn’t really solve the problem in the case where someone removes the CRD after the opeator starts. We would then have error cases that are not surfaced to the developer/client.
If Profiles is disabled, modify the Function deployment so that the function is not scheduleable. One interesting way to do this is to use a Profile that we know will cause the function to be broken. We have two options, (a) apply a bad RuntimeClass or (b) add a toleration for an unlikely to exist taint.

The way that this works is that the the GetProfiles method would return the DisabledPodProfile if the profiles feature is disabled (because we can’t find the CRD). Conversely, the GetProfilesToRemove would include the DisabledPodProfile when the profiles feature is enabled. THis means that you have the following possibilities

profiles disabled: GetProfiles returns DisabledPodProfile
profiles disabled: GetProfilesToRemove returns nil or the empty list
profiles enabled: GetProfiles returns the list of profiles (as it would behave today) which may be emtpy
profiles enabled: GetProfilesToRemove returns the list of profiles (or an empty value) and we always append the DisablePodProfile

This combination of behaviors ensures that we disable profile dependent functions when the CRD is missing and we will reenable these functions once the CRD exists.

Option (a) looks like this

// DisablePodProfile is used when the profiles feature is disabled. 
// 
// The cluster admin can fix this function by applying the Profile CRD, restarting the gateway pod, 
// and then redeploying the effected Functions.
var DisablePodProfile = Profile{
	RuntimeClassName: "of-profiles-disabled",
}

Option (b) looks like this

// DisablePodProfile is used when the profiles feature is disabled. The toleration `openfass-profiles=disabled`
// will be added to the Function so that it is not schedulable by default.
//
// The cluster admin can fix this function by  applying the Profile CRD, restarting the gateway pod, 
// and then redeploying the effected Functions.
//
//  Alternatively, the cluster admin can override this by adding the taint 
//     `openfaas-pfoiles=disabled:NoSchedule` 
// to the cluster nodes.
var DisablePodProfile = Profile{
	Tolerations: []corev1.Toleration{
		{
			Key:      "openfaas-profiles",
			Value:    "disabled",
			Operator: corev1.TolerationOpEqual,
		},
	},
}

I can then check if the profiles client is configured and include this profile in the Add/Remove checks. The benefit of using a Toleration, is that a cluster admin could decide (for some reason) to ignore the profiles completely and allow functions to be scheduled. Additionally, they can also just fix the cluster by deploying the Profile CRD and restarting the controller/operator. It will then remove this profile the next time the functions are updated/redeployed.

LucasRoesler on Oct 30, 2021

i see this warning and looks like the pod restarted, but did not crashed

✦ $ k logs -n openfaas gateway-f4dc6c956-w8hbk faas-netes -f
2021/10/30 13:03:24 Version: 0.13.8     commit: c0a8c3cba4156fac9847953a83cea03bf54e42ef
W1030 13:03:24.564036       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2021/10/30 13:03:24 HTTP Read Timeout: 1m0s
2021/10/30 13:03:24 HTTP Write Timeout: 1m0s
2021/10/30 13:03:24 ImagePullPolicy: Always
2021/10/30 13:03:24 DefaultFunctionNamespace: openfaas-fn
2021/10/30 13:03:24 Starting controller
I1030 13:03:24.587189       1 shared_informer.go:223] Waiting for caches to sync for faas-netes:deployments
I1030 13:03:24.699939       1 shared_informer.go:230] Caches are synced for faas-netes:deployments
I1030 13:03:24.700054       1 shared_informer.go:223] Waiting for caches to sync for faas-netes:endpoints
I1030 13:03:24.800314       1 shared_informer.go:230] Caches are synced for faas-netes:endpoints
I1030 13:03:24.800468       1 shared_informer.go:223] Waiting for caches to sync for faas-netes:profiles
I1030 13:03:24.901292       1 shared_informer.go:230] Caches are synced for faas-netes:profiles
W1030 13:09:27.636346       1 reflector.go:404] github.com/openfaas/faas-netes/main.go:193: watch of *v1.Profile ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: no kind \"Profile\" is registered for version \"openfaas.com/v1\" in scheme \"github.com/openfaas/faas-netes/pkg/client/clientset/versioned/scheme/register.go:20\"") has prevented the request from su
cceeding

using k8s 1.22.1 running on a Kind cluster

cpanato on Oct 30, 2021

@alexellis taking a look now

LucasRoesler on Oct 30, 2021