postgres-operator: securityContext.fsGroup setting lead to permission issue on restart / remount
Overview
Our postgres instances regularly stop working with following in the logs:
2022-02-08 13:36:54.275 UTC [1257059] FATAL: data directory "/pgdata/pg13" has invalid permissions
2022-02-08 13:36:54.275 UTC [1257059] DETAIL: Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
Indeed, after restarts / remounts, the pg13 folder has permissions rwxrwx---, even though we removed group write manually.
We suspect the issue is that securityContext.fsGroup is being set at https://github.com/CrunchyData/postgres-operator/blob/2e18aef93dd2d6dee065ad00c959dc9fabc6da79/internal/postgres/reconcile.go#L280
According to https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#security-context this causes the csi driver to add the group 26 to the volume and set permissions to current permissions OR rw-rw----, therefore always adding the unwanted g+w.
We tried setting openshift: true, which removes fsGroup from the descriptor, but this causes the postgres startup container to fail, as the secrets cannot be read anymore
Initializing ...
::postgres-operator: uid::26
::postgres-operator: gid::26
::postgres-operator: postgres path::/usr/pgsql-13/bin/postgres
::postgres-operator: postgres version::postgres (PostgreSQL) 13.5
::postgres-operator: config directory::/pgdata/pg13
::postgres-operator: data directory::/pgdata/pg13
install: cannot open '/pgconf/tls/replication/tls.crt' for reading: Permission denied
install: cannot open '/pgconf/tls/replication/tls.key' for reading: Permission denied
install: cannot open '/pgconf/tls/replication/ca.crt' for reading: Permission denied
Environment
Please provide the following details:
- Platform: k3s v1.22.5+k3s1
- PGO: 5.0.4
- Postgres Version: crunchy-postgres:centos8-13.5-0
- Storage: Hetzner CSI Driver v1.6.0 (https://github.com/hetznercloud/csi-driver)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 8
- Comments: 26 (6 by maintainers)
Hello, just wanted to let y’all know if you haven’t seen the attached PR, the
fsGroupChangePolicyfix that was suggested and tested in this thread has been merged in to PGO and should be out in the next release. Reminder (for future people who find this) that this fix is only effective in K8s 1.20+Hope this helps and if you’re still experiencing a problem like this, please let the community know.
I have been using the change mentioned by @cbandy for about ~6 weeks now and it haven’t had problems since then.
@matzik12 We did not change anything manually in our clusters, but used a policy agent to automatically patch the pods created by the controller manager. Here is an example for kyverno, but since OPA Gatekeeper has support for Mutations now that should work as well.