prometheus-operator: prometheus fails to start with permission denied errors
What happened?
After being fine for a couple days, Prometheus mysteriously started crashing with this error:
level=error ts=2020-09-02T15:16:55.096Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x7ffd4fab7a45, 0xb, 0x14, 0x30898a0, 0xc000786180, 0x30898a0)
/app/promql/query_logger.go:117 +0x4cd
main.main()
/app/cmd/prometheus/main.go:374 +0x4f08
Did you expect to see something different?
Ya, I expected prometheus not to crash
How to reproduce it (as minimally and precisely as possible):
I deployed this resource:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: tilt-prometheus
namespace: tilt-telemetry
spec:
serviceAccountName: prometheus
resources:
requests:
memory: 200Mi
serviceMonitorSelector:
matchLabels:
prometheus: tilt-prometheus
enableAdminAPI: false
storage:
volumeClaimTemplate:
spec:
storageClassName: standard
resources:
requests:
storage: 100G
Environment
GKE 1.17
- Prometheus Operator version:
https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.41.1/bundle.yaml
- Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-26T20:32:49Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.9-gke.1703", GitCommit:"df645059d1c711bd1d15feff39c6deac341a7c4b", GitTreeState:"clean", BuildDate:"2020-08-18T16:07:52Z", GoVersion:"go1.13.9b4", Compiler:"gc", Platform:"linux/amd64"}
- Kubernetes cluster kind:
GKE
- Manifests:
insert manifests relevant to the issue
- Prometheus Operator Logs:
N/A
Anything else we need to know?:
I was able to work-around this problem by adding an initContainer. Here’s what my YAML looks like now:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: tilt-prometheus
namespace: tilt-telemetry
spec:
serviceAccountName: prometheus
resources:
requests:
memory: 200Mi
serviceMonitorSelector:
matchLabels:
prometheus: tilt-prometheus
enableAdminAPI: false
initContainers:
- name: prometheus-data-permission-fix
image: busybox
command: ["/bin/chmod","-R","777", "/prometheus"]
volumeMounts:
- mountPath: /prometheus
name: prometheus-tilt-prometheus-db
subPath: prometheus-db
storage:
volumeClaimTemplate:
spec:
storageClassName: standard
resources:
requests:
storage: 100G
Obviously, this is not ideal, because it hard-codes a bunch of stuff that the operator does with the storage.
IMHO, an intermediate first step towards fixing this would be to add a feature to the operator such that, if you specify storage, the operator injects an init container that does some sanity checking on the storage (permissions, etc).
I’ve seen other Helm charts do this (e.g., the Astronomer Airflow chart) and it seems to be emerging as a common pattern for operators that deal with storage, since storage on Kubernetes seems to have so many footguns
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 16 (6 by maintainers)
I bumped into a similar problem just now, also with dynamically provisioned storage on GKE:
Got it fixed by adding a security context, e.g.:
I’m guessing it has to do with prometheus UID’s and the provisioned disk? For what it’s worth:
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.11-gke.5", GitCommit:"baccd25d44f1a0d06ad3190eb508784efbb990a5", GitTreeState:"clean", BuildDate:"2020-06-25T22:55:26Z", GoVersion:"go1.13.9b4", Compiler:"gc", Platform:"linux/amd64"}
Adding
securityContext
to the Prometheus object should fix the problem.I think I’ve grokked this problem in my specific case. I believe with certainty that this issue should be closed as this is neither an issue with Prometheus nor prometheus-operator.
If you are using CSI (you probably are) with the new release of a CSI controller sidecar, they’ve removed the default
ext4
fsType. Previously (in Kubernetes v1.12) there was another code change that stopped applying fsGroup (chown) to the volume if it does not specify fsType.Obviously, it’s impossible to account for all existing Kubernetes distributions and their way of deploying CSI, but simply adding
--default-fstype=ext4
to external-provisioner or explicitly specifyingfsType
on StorageClasses should fix this problem on new volumes. You’ll have to recreate a PersistentVolume if it does not specifyfsType
.