faas-netes: Gateway Pod crashes when Profiles CRD is deleted
My actions before raising this issue
- Followed the troubleshooting guide
- Read/searched the docs
- Searched past issues
@alexellis Posting here and plan to edit and follow-up, so that this issue doesn’t get lost.
Expected Behaviour
I don’t think that this should crash the gateway, but I do think an error should be logged.
Current Behaviour
Gateway crashes if you try to create a function with a profile; this seems initially related to using the operator and CRD flags set to true on the faas-netes/openfaas helm chart.
Are you a GitHub Sponsor (Yes/No?)
Check at: https://github.com/sponsors/openfaas
- Yes
- No
- No, but I sponsor Alex
List All Possible Solutions and Workarounds
Which Solution Do You Recommend?
Steps to Reproduce (for bugs)
Context
Crashing our gateway instances rather than simply not creating functions.
-
FaaS-CLI version ( Full output from:
faas-cli version): 0.13.13 -
Docker version
docker version(e.g. Docker 17.0.05 ): 20.10.8 -
Which deployment method do you use?:
- OpenFaaS on Kubernetes
- faasd
server v1.21.3 client v1.22.2
-
Operating System and version (e.g. Linux, Windows, MacOS): MacOS
-
Code example or link to GitHub repo or gist to reproduce problem:
-
Other diagnostic information / logs from troubleshooting guide
Next steps
You may join Slack for community support.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 21 (20 by maintainers)
I plan on doing a full repro and updating this issue to whatever I identify to be the suspected problem; I do not intend to let this stale, just juggling over here.
@alexellis i can not reproduce the issue as described:
I can reproduce the original bug that causes the gateway pod to crash when the Profile CRD is missing.
Regarding the bug I can reproduce. I can ensure that the Gateway starts by checking for the Profile CRD during startup. However, now we need to discuss error edge cases.
When using the controller (ie classic faas-netes), I can also add errors to the API when it sees Profiles trying to be used in a cluster that has not enabled them. These seems fine, but we do have error cases that can happen when the CRD is deleted after startup. If we want to be very very safe, i could check for the CRD on every function deploy/update request, but those deploys will also fail with a error anyway, so i am not sure the extra check is really needed. We should probably change some of the logic so that the profile client is only used when the current deployment or the current request reference Profiles.
The operator has the same problem, but with a twist: we don’t currently have any kind of validation webhook/controller, we need to handle Function objects that reference profiles even though the cluster doesn’t support it.
I see four options
The way that this works is that the the
GetProfilesmethod would return theDisabledPodProfileif the profiles feature is disabled (because we can’t find the CRD). Conversely, theGetProfilesToRemovewould include theDisabledPodProfilewhen the profiles feature is enabled. THis means that you have the following possibilitiesGetProfilesreturnsDisabledPodProfileGetProfilesToRemovereturnsnilor the empty listGetProfilesreturns the list of profiles (as it would behave today) which may be emtpyGetProfilesToRemovereturns the list of profiles (or an empty value) and we always append theDisablePodProfileThis combination of behaviors ensures that we disable profile dependent functions when the CRD is missing and we will reenable these functions once the CRD exists.
Option (a) looks like this
Option (b) looks like this
I can then check if the profiles client is configured and include this profile in the Add/Remove checks. The benefit of using a Toleration, is that a cluster admin could decide (for some reason) to ignore the profiles completely and allow functions to be scheduled. Additionally, they can also just fix the cluster by deploying the Profile CRD and restarting the controller/operator. It will then remove this profile the next time the functions are updated/redeployed.
i see this warning and looks like the pod restarted, but did not crashed
using k8s 1.22.1 running on a Kind cluster
@alexellis taking a look now