microk8s: snap auto-refresh breaks cluster
This morning a close-to-production cluster fell over after snap’s auto-refresh “feature” failed on 3 of 4 worker nodes - looks like it hanged at the Copy snap "microk8s" data step. microk8s could be restarted after aborting the auto-refresh, but this only worked after manually killing snapd… For a production-ready Kubernetes distribution I really think this is a far from acceptable default… Perhaps until snapd allows disabling auto-refreshes microk8s scripts could recommend running sudo snap set system refresh.hold=2050-01-01T15:04:05Z or similar. Also a kubernetes-native integration with snapd refreshes could be considered (e.g. a prometheus/grafana dashboard/alert) to prompt manual updates - presumably one node at a time to begin with.
Otherwise microk8s is working rather well so thank you very much.
More details about the outage:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
10.aa.aa.aaa Ready <none> 38d v1.17.3
10.aa.aa.aaa NotReady <none> 18d v1.17.2
10.aa.aa.aaa NotReady <none> 38d v1.17.2
10.aa.aa.aaa NotReady <none> 18d v1.17.2
aaa-master Ready <none> 59d v1.17.3
microk8s is disabled…
root@wk3:/home# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.43.3 8689 stable canonical✓ core
kubectl 1.17.3 1424 1.17 canonical✓ classic
microk8s v1.17.2 1176 1.17 canonical✓ disabled,classic
root@wk3:/home# snap changes microk8s
ID Status Spawn Ready Summary
20 Doing today at 09:56 AEDT - Auto-refresh snap "microk8s"
Data copy appears hanged
root@wk3:/home# snap tasks --last=auto-refresh
Status Spawn Ready Summary
Done today at 09:56 AEDT today at 09:56 AEDT Ensure prerequisites for "microk8s" are available
Done today at 09:56 AEDT today at 09:56 AEDT Download snap "microk8s" (1254) from channel "1.17/stable"
Done today at 09:56 AEDT today at 09:56 AEDT Fetch and check assertions for snap "microk8s" (1254)
Done today at 09:56 AEDT today at 09:56 AEDT Mount snap "microk8s" (1254)
Done today at 09:56 AEDT today at 09:56 AEDT Run pre-refresh hook of "microk8s" snap if present
Done today at 09:56 AEDT today at 09:57 AEDT Stop snap "microk8s" services
Done today at 09:56 AEDT today at 09:57 AEDT Remove aliases for snap "microk8s"
Done today at 09:56 AEDT today at 09:57 AEDT Make current revision for snap "microk8s" unavailable
Doing today at 09:56 AEDT - Copy snap "microk8s" data
Do today at 09:56 AEDT - Setup snap "microk8s" (1254) security profiles
Do today at 09:56 AEDT - Make snap "microk8s" (1254) available to the system
Do today at 09:56 AEDT - Automatically connect eligible plugs and slots of snap "microk8s"
Do today at 09:56 AEDT - Set automatic aliases for snap "microk8s"
Do today at 09:56 AEDT - Setup snap "microk8s" aliases
Do today at 09:56 AEDT - Run post-refresh hook of "microk8s" snap if present
Do today at 09:56 AEDT - Start snap "microk8s" (1254) services
Do today at 09:56 AEDT - Clean up "microk8s" (1254) install
Do today at 09:56 AEDT - Run configure hook of "microk8s" snap if present
Do today at 09:56 AEDT - Run health check of "microk8s" snap
Doing today at 09:56 AEDT - Consider re-refresh of "microk8s"
There doesn’t seem to be much to copy anyway:
root@wk3 /v/l/snapd# du -sh /var/lib/snapd/ /var/snap/ /snap
527M /var/lib/snapd/
74G /var/snap/
2.0G /snap
root@wk3 /s/microk8s# du -sh /snap/microk8s/*
737M /snap/microk8s/1176
737M /snap/microk8s/1254
root@wk3 /s/microk8s# du -sh /var/snap/microk8s/*
232K /var/snap/microk8s/1176
74G /var/snap/microk8s/common
Starting microk8s fails
user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress
root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress
Fails to abort…
root@wk3:/home# snap abort 20
root@wk3:/home# snap changes
ID Status Spawn Ready Summary
20 Abort today at 09:56 AEDT - Auto-refresh snap "microk8s"
user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress
root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress
snapd service hangs when trying to stop it…
root@wk2 ~# systemctl stop snapd.service
(hangs)
have to resort to manually stopping the process
killall snapd
finally change is undone…
root@wk3:/home# snap changes
ID Status Spawn Ready Summary
20 Undone today at 09:56 AEDT today at 10:41 AEDT Auto-refresh snap "microk8s"
root@wk3:/home# snap tasks --last=auto-refresh
Status Spawn Ready Summary
Done today at 09:56 AEDT today at 10:41 AEDT Ensure prerequisites for "microk8s" are available
Undone today at 09:56 AEDT today at 10:41 AEDT Download snap "microk8s" (1254) from channel "1.17/stable"
Done today at 09:56 AEDT today at 10:41 AEDT Fetch and check assertions for snap "microk8s" (1254)
Undone today at 09:56 AEDT today at 10:41 AEDT Mount snap "microk8s" (1254)
Undone today at 09:56 AEDT today at 10:41 AEDT Run pre-refresh hook of "microk8s" snap if present
Undone today at 09:56 AEDT today at 10:41 AEDT Stop snap "microk8s" services
Undone today at 09:56 AEDT today at 10:41 AEDT Remove aliases for snap "microk8s"
Undone today at 09:56 AEDT today at 10:41 AEDT Make current revision for snap "microk8s" unavailable
Undone today at 09:56 AEDT today at 10:41 AEDT Copy snap "microk8s" data
Hold today at 09:56 AEDT today at 10:30 AEDT Setup snap "microk8s" (1254) security profiles
Hold today at 09:56 AEDT today at 10:30 AEDT Make snap "microk8s" (1254) available to the system
Hold today at 09:56 AEDT today at 10:30 AEDT Automatically connect eligible plugs and slots of snap "microk8s"
Hold today at 09:56 AEDT today at 10:30 AEDT Set automatic aliases for snap "microk8s"
Hold today at 09:56 AEDT today at 10:30 AEDT Setup snap "microk8s" aliases
Hold today at 09:56 AEDT today at 10:30 AEDT Run post-refresh hook of "microk8s" snap if present
Hold today at 09:56 AEDT today at 10:30 AEDT Start snap "microk8s" (1254) services
Hold today at 09:56 AEDT today at 10:30 AEDT Clean up "microk8s" (1254) install
Hold today at 09:56 AEDT today at 10:30 AEDT Run configure hook of "microk8s" snap if present
Hold today at 09:56 AEDT today at 10:30 AEDT Run health check of "microk8s" snap
Hold today at 09:56 AEDT today at 10:30 AEDT Consider re-refresh of "microk8s
root@wk3:/home# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.43.3 8689 stable canonical✓ core
kubectl 1.17.3 1424 1.17 canonical✓ classic
microk8s v1.17.2 1176 1.17 canonical✓ classic
Nothing much in snapd logs except for a polkit error - unsure if related:
root@wk3:/home# journalctl -b -u snapd.service
...
Mar 09 06:11:34 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 16:11:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 16:11:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 19:06:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 19:06:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 02:51:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 10 02:51:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 09:56:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl"
Mar 10 10:12:18 wk3 snapd[15182]: daemon.go:208: polkit error: Authorization requires interaction
Mar 10 10:39:24 wk3 systemd[1]: Stopping Snappy daemon...
Mar 10 10:39:24 wk3 snapd[15182]: main.go:155: Exiting on terminated signal.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Killing process 15182 (snapd) with signal SIGKILL.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Failed with result 'timeout'.
Mar 10 10:40:54 wk3 systemd[1]: Stopped Snappy daemon.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Found left-over process 16729 (sync) in control group while starting unit. Ignoring.
Mar 10 10:40:54 wk3 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 10 10:40:54 wk3 systemd[1]: Starting Snappy daemon...
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:346: started snapd/2.43.3 (series 16; classic) ubuntu/18.04 (amd64) linux/4.15.0-88-generic.
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:439: adjusting startup timeout by 45s (pessimistic estimate of 30s plus 5s per snap)
Mar 10 10:40:54 wk3 systemd[1]: Started Snappy daemon.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 2
- Comments: 98 (20 by maintainers)
Today a have experienced a crash of the PRODUCTION microk8s 3-nodes “HA” cluster. It just auto-updated to 1.21.5 ! As a programmer, admin, my mind even cannot comprehend what people deciding for the crucial services packaging have in mind to choose such a DNA broken tool as a snap??? Why at all UBUNTU uses it, when it hardly suitable even for desktop apps, and not suitable for services at all??? What is some medic.stuff would buy their adverting as “highly available” and people die because it auto-updates??? They must drop snap for anything aside the desktop apps, and better drop it at all and use proven by years .deb …
Your point was that security is paramount and absolute, that it should be the excuse that makes this problem okay, it’s not, it’s an excuse that only exasperates this problem and the whole of snap for servers in general.
Snaps are fine for user apps, those can deal with being restarted, crashing, shutting down, again and again. Server apps need more delicacy, planning, and oversight. Any admin/operator would not want the developer control over when, how, and why something will update, they want complete control over their systems, and the snaps auto-updating feature is a complete insult to that.
I’m glad you agree, then? I’d rather have a cluster which is outdated and vulnerable, and possibly get hacked, if it’s about my own oversight and my own fault (at least then i can tune it to my own schedule and my own system). With auto-update, and even the update window, that control is taken away from me, as now i have to scramble to make sure the eventual update will not fuck with my system, and then to do it manually, safe, and controlled to make sure it does not fuck over the data. (which it did for me, 1.2TB of scraping data, all corrupted because docker didnt want to close within 30 seconds, after which it got SIGKILLd)
As a sysadmin, I control a developer’s software, when, where, and how. The developer doesn’t control my system, unless I tell it to. And even then, only on my own conditions.
Snaps violated this principle, and that’s why I’m incredibly displeased with them.
@ktsakalozos the point of “security” is pretty moot if it breaks everything while updating it, it’s defeating its own purpose.
The Kubernetes project ships a few releases every month [1]. These releases include security, bug and regression fixes. Every production grade Kubernetes distribution should have a mechanism to release such fixes even before they are released from upstream. For MicroK8s this mechanism is the snaps. Snaps allow us to keep your Kubernetes infrastructure up to date not only with fresh Kubernetes binaries but also update/fix integrations with underlying system and the Kubernetes ecosystem.
If you do not want to take the risk of automated refreshes you have at least two options:
[1] https://github.com/kubernetes/kubernetes/releases [2] https://docs.ubuntu.com/snap-store-proxy/en/ [3] https://snapcraft.io/docs/keeping-snaps-up-to-date
For anyone reading this with the same issue,
snapdcurrently doesnt allow any indefinite auto-update disabling, other than this suggestion through a forum thread with a bigger umbrella issue discussion about this; https://forum.snapcraft.io/t/disabling-automatic-refresh-for-snap-from-store/707/268TL;DR:
snap download foo ; snap install foo.snap --dangerous, replacefoowith application in question.Personally, the fact that
snapddoesn’t allow more comprehensive and cooperative options like (web)hooks on new snap version, and have an external system handle manual refreshes one-by-one (for example by draining a node first before refreshing it, running some canary checks on it, then doing the same one-by-one for the rest, reverting entirely on first error (+ autoemail describing failed upgrade)), would be of great help and would even be still within the ethos of keeping snaps up-to-date, but the system to trust sysadmins to make it so needs to be there.A note that is probably useful for people that have commented in the issue:
Starting with snapd 2.58, it is possible to indefinitely hold MicroK8s (and any other installed snap packages) from updating with the following command:
See also the “Control updates” section in the snapd documentation
Not stale
Yes I underestand that… so if is a political of canonical & snap dont change that (really i think that is a wrong idea dont let the user disable that)… why then waste the time in this issue??? snap team is not listen to users, because without use a workaround like discuss in this issue… or better, go to the microk8s kubernetes dristribution official doc and install it on a production cluster, you will mess all when autorefresh change something… that’s from my sysadmin point of view is faaaaaaaaaar from a production grade system.
@evildmp Do you really think you gave a satisfying professional response to my statement ? You have missed a chance to clarify:
‘As a director of engineering at canonical I’d like to assure you that you can expect a professional and complete documentation for our products. And this also includes controversial fixes or alternative or internal usage instructions for our products from our staff or our users even if we don’t recommend those to our customers. We encourage and enforce transparency. We are committed to leaving the choice to our users and customers to use and deploy our products in a way that best suits their needs even if we don’t agree with it or consider it harmful. Of course we will flag those with a big fat warning label’.
As long as you don’t say that the best answer to this issue still comes from your former employee [see: Bypassing store refresh]. As long as interns don’t have the courage or the companies allowance to give some really helpful background information I’m afraid its not cynical to say then hopefully more people are leaving the company to make up their minds finally. That’d be a really sad conclusion for the friends of Canonical, Ubuntu and Microk8s.
…but given that snap is what it is, and microk8s is what it is, I’ll try to bring this back to something constructive.
After some experimentation, it looks like the basic problem is that
microk8s.stopbrings down the k8s infrastructure (including the hostpath provisioner) before the containers have had a chance to react to the SIGTERM and finish gracefully. Would it be possible to change either microk8s.stop or the snap refresh hook so that the containers are allowed to terminate before the infrastructure disappears?This is my test script, which I piped to a hostpath-provisioned folder. When bringing the container down via kubectl, it counts to 30 seconds. When doing
microk8s.stop, I see nothing after “Waiting for SIGTERM”, even though I can see the process live for 30 seconds after that (inps).@ktsakalozos I’m sure you do understand, that simultaneous update of the package on whatever number of the nodes, WILL lead to service stop, as ALL the nodes restarts the service simultaneously? For kubernetes, you (normally) do not restart the node, you drain it, making sure other nodes catch up for the pods, and than upgrade/maintain. So you just CANNOT distribute patches like you do. Yes, you just cannot. It may sound a litle strong, but may you imagine when lives depend on the stuff you do, or mass transport, or… whatever, where multiple persons affected? You DO name your microk8s SUPER reliable, So ANY, even minor upgrades must be done by the admin, and admin might choose than to implement auto-upgrade, because the admin will do it right, making node drain, etc.
I suspect people behind snap even scoff at us, as seems they are not allowing delaying auto-refresh more than for 60 days.
False, point me where exactly in your comment where you said that, and keep in mind I don’t consider the snap store proxy to be a solution in this.
Great, thanks snap autorefresh, you have crashed my entire cluster with this:
/snap/microk8s/2338/bin/dqlite: symbol lookup error: /snap/microk8s/2338/bin/dqlite: undefined symbol: sqlite3_system_errnoThis isn’t stale, this is still a problem
@evildmp , It has nothing to do with opensource and etc. I do not criticize the code, but the wrong packaging way itself, which breaking WHOLE idea of having microk8s, and i’m using straight words for that. Once again I will try to explain what is wrong, it’s of course useless, but anyways. Looking at the meaning of the messages of other user of this thread: everyone tells the same, words different. Everyone says “It’s broken, it’s unacceptable, change it, do something”. If there is no other way to multiplatform distiribution of microk8s, just state in the docs, in front, very visible and understandable “DO NOT USE IT FOR PRODUCTION WHEN INSTALLED VIA SNAP BECAUSE YOUR SERVICE MAY BREAK RANDOMLY, UNEXPECTEDLY”. And this way, no one will have any issues. We will just look for other k8s from the start, and use microk8s (probably, not really) for dev only, and we will not experience the problems randomly and unexpectedly in production, because, according to the front page it’s rock solid reliable. Stating in the docs somewhere deep deep “install manually if you do not like snap”, changes nothing, just making users even more disappointed.
The approach of the mikrok8s team is not professional. The sole purpose of kubernetes is to provide a platform for running fault tolerant services, and it is completely ruined by packaging it into totally unsuitable, and DNA broken “snap” tool. It’s hard to say more. I personaly got rid of all microk8s on my servers and migrated to k3s. Next is to replace back ubuntu by plain Debian.
@evildmp if I were to guess, a large share of the frustration is not personal, but rather aimed at snap’s packaging in-of-itself, which is then a main root to cause this problem.
The solution offered is a hack, an explicit circumvention of the problem, which does not do much to offer a satisfying resolution, nor does it help lighten the burden that the problem caused, it only cripples the effectiveness of the platform, while a better solution is available from snap’s side, while they do not wish to give developers those tools, out of political and ideological reasons, that explicitly tear away control from users in a patriarchal fashion, in the sense that the developers would like to think they know their users’ systems better than the users would. (Which, imo, that is maybe true for normal application users, this becomes far less true for developers, and very much not true for system administrators, for which snaps all have the same attitude)
I don’t want to perpetuate the cycle here, at the very least know this; it wasn’t personal, the frustration is high, and this issue is just one part of the knot where the pressure became too high.
@a-hahn Hi. In respect of the documentation, no matter how many warnings or comments are added to things, sadly many people don’t read them. There is no hidden agenda, this is simply a matter of responsibility. If you buy a car and want to disconnect all the warning lights, that is also up to you, but you wouldn’t expect to find the instructions to do so printed in the owners manual. Please add your method as a comment on discourse if you like.
@ktsakalozos @evilnick well i understand that its ‘hard to recommend’ but I think it is well worth mentioning because its an obvious and simple solution for many users. Trying to hide that piece of information just hurts your / Canonicals reputation as a first class solution provider. This is even more true as snaps often stack upon each other (when you consider microk8s installed in lxc containers both installed via snaps) making troubleshooting much more difficult as it needs to be. Even more I can’t see anything positive about auto-updates as they are a backdoor into the installation and who knows which hidden features or paywalls might be introduced in the future. Misguiding people never pays in the long run. I think you should think twice about not accepting the docs update. I feel alarmed by now.
Interesting read from a former employee of canonical working in the snap advocacy team on howto disable auto-updates efficiently: https://popey.com/blog/2021/05/disabling-snap-autorefresh/ TLDR:
snap download microk8s && snap install microk8s --dangerousprevents any autorefreshingYup. I guess there are really two issues at play here:
microk8s.stop. Uncontroversial (probably), will make # 1 less harmful, also helps e.g. non-snap-triggered restarts and reboots.Should # 2 get a separate issue?
Based on what I’ve read here and elsewhere, I think that snap-the-philosophy is fundamentally unsound for high availability or production server deployments. When it comes to snap-the-technology, I find it darkly amusing that it has found a way of restarting k8s that corrupts my postgresql db. Which it now does reliably 4 times a day.
@ktsakalozos Also, if you want to reproduce, 3 nodes HA 1.21/stable with rook-ceph installed, osds on each nodes, some pods that use volumes from these osds, refresh one of the nodes and that should do the trick.
@Dart-Alex i would recommend moving away from using snaps, if this is your experience, as i expect nobody from canonical will be giving this issue a genuine consideration.
I had to restart all 3 nodes to bring it up again.
For the rest, you mentioned, you probably understand, that all you suggested are the hacks, and totally not suitable for a distribution? As nobody suggests “stopping patches distribution”, but do you understand that NO ONE in the clear mind auto-updates critical services, like databases, kubernets clusters, and similar stuff? When you deploy a highly resilient, highly available, highly redundant stuff, which as k8s, which on it’s own ATTEMPTS to be resistant to failures, trying to keep itself alive… AND YOU DO kind of “kill -9” it from the outside??? What is the whole reason of developing microk8s at all than if you totally ruin it’s redundancy and resiliency by simply mischoosing the distribution tool?
@ktsakalozos It’s one of the first things we tried but it did not help. Not sure about now, it’s a staging server and after hours now.
@vazir Absolutely, it’s now become a priority for us to migrate away from microk8s. It’s a bit irresponsible that the docs/github page does not have a big warning saying “microk8s should not be used in any kind of situation where multiple people depend on it” being it’s obviously a complete deal-breaker for any kind of serious usage. Going by this thread this has been an issue from day 1 and is never going to change with the unwillingness of snap to make autorefresh easy to disable. The only way any kind of auto-update system is reasonable for serious applications is if you’re Microsoft and have complete control over every line of update code you ship as well as gigantic resources to test every possible setup and scenario. And even then they still let you disable it on systems meant for serious use cases.
You are given this option. You can block, schedule or postpone refreshes. You are even given the option to try an “offline” manual deployment that never updates. Although this may suit your needs, it is very hard from the Kubernetes distribution perspective to call this a good practice.