rancher: rabbitmq-ha breaks with "MY_POD_NAME: not found" after scaling to 0 and back
What kind of request is this (question/bug/enhancement/feature request):
bug
Steps to reproduce (least amount of steps as possible):
Install the rabbitmq-ha helm chart available from catalogs. Scale to 0 and then scale back up to 3.
Result:
Pod will begin to crash loop with the following error in the pod’s log:
/opt/rabbitmq/sbin/rabbitmq-server: eval: line 1: MY_POD_NAME: not found
Other details that may be helpful:
A customer has the same issue with rabbitmq but the error is:
/usr/lib/rabbitmq/bin/rabbitmq-server: 1: eval: MY_POD_IP: not found
The deployment manifest they used is at the bottom of the issue
Environment information
-
Rancher version (
rancher/rancher
/rancher/server
image tag or shown bottom left in the UI): I have been able to recreate on 2.1.3 and 2.1.6 -
Installation option (single install/HA): single
Cluster information
-
Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom
-
Machine type (cloud/VM/metal) and specifications (CPU/memory): VM
-
Kubernetes version (use
kubectl version
): 1.11.5, 1.11.6 and I even tried it on 1.12.4 with the same result -
Docker version (use
docker version
): 17.03.2
Deployment Manifest
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rabbitmq-bootstrap
namespace: dev-shared
data:
bootstrap.sh : |-
#!/bin/bash
sleep 10
spinoff() {
if [[ "$(hostname)" == "rabbitmq-0" ]]; then
until rabbitmqctl cluster_status; do
echo "waiting for rabbit to start"
sleep 1
done
sleep 30
echo "Running on node 0. Starting bootstrap."
rabbitmqctl add_vhost opspub
rabbitmqctl add_vhost mcs
rabbitmqctl add_vhost offergating
rabbitmqctl add_vhost odo
rabbitmqctl add_user opspubUser %opspubUser%
rabbitmqctl add_user mcsUser %mcsUser%
rabbitmqctl add_user oguser %oguser%
rabbitmqctl add_user odouser %odouser%
rabbitmqctl set_permissions -p opspub opspubUser ".*" ".*" ".*"
rabbitmqctl set_permissions -p mcs mcsUser ".*" ".*" ".*"
rabbitmqctl set_permissions -p offergating oguser ".*" ".*" ".*"
rabbitmqctl set_permissions -p odo odouser ".*" ".*" ".*"
rabbitmqctl set_policy ha-all ".*" \
'{"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"automatic"}' -p opspub
rabbitmqctl set_policy ha-all ".*" \
'{"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"automatic"}' -p mcs
rabbitmqctl set_policy ha-all ".*" \
'{"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"automatic"}' -p offergating
rabbitmqctl set_policy ha-all ".*" \
'{"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"automatic"}' -p odo
rabbitmqctl add_user development password
rabbitmqctl set_user_tags development management
rabbitmqctl set_permissions -p opspub development ".*" ".*" ".*"
rabbitmqctl set_permissions -p mcs development ".*" ".*" ".*"
rabbitmqctl set_permissions -p offergating development ".*" ".*" ".*"
rabbitmqctl set_permissions -p odo development ".*" ".*" ".*"
fi
}
spinoff &
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rabbitmq
namespace: dev-shared
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: endpoint-reader
namespace: dev-shared
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["get"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: endpoint-reader
namespace: dev-shared
subjects:
- kind: ServiceAccount
name: rabbitmq
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: endpoint-reader
---
kind: Service
apiVersion: v1
metadata:
namespace: dev-shared
name: rabbitmq
labels:
app: rabbitmq
type: LoadBalancer
spec:
type: NodePort
ports:
- name: http
protocol: TCP
port: 15676
targetPort: 15672
- name: amqp
protocol: TCP
port: 5676
targetPort: 5672
selector:
app: rabbitmq
---
kind: Service
apiVersion: v1
metadata:
name: rabbitmq-data
namespace: dev-shared
spec:
selector:
app: rabbitmq
ports:
- name: rabbitmq-data
protocol: TCP
port: 5676
targetPort: 5672
externalIPs:
- 10.219.3.2
- 10.219.3.3
- 10.219.3.4
- 10.219.3.5
- 10.219.3.6
- 10.219.2.255
- 10.219.3.0
---
kind: Service
apiVersion: v1
metadata:
name: rabbitmq-ui
namespace: dev-shared
spec:
selector:
app: rabbitmq
ports:
- name: rabbitmq-data
protocol: TCP
port: 15676
targetPort: 15672
externalIPs:
- 10.219.3.2
- 10.219.3.3
- 10.219.3.4
- 10.219.3.5
- 10.219.3.6
- 10.219.2.255
- 10.219.3.0
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rabbitmq-config
namespace: dev-shared
data:
enabled_plugins: |
[rabbitmq_management,rabbitmq_peer_discovery_k8s].
rabbitmq.conf: |
cluster_formation.randomized_startup_delay_range.min = 0
cluster_formation.randomized_startup_delay_range.max = 2
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.address_type = ip
cluster_formation.node_cleanup.interval = 5
cluster_formation.node_cleanup.only_log_warning = false
cluster_partition_handling = autoheal
queue_master_locator = random
loopback_users.guest = false
vm_memory_high_watermark.absolute = 1024MiB
default_user = admin
default_pass = %admin%
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: rabbitmq
namespace: dev-shared
spec:
serviceName: rabbitmq
replicas: 3
template:
metadata:
labels:
app: rabbitmq
spec:
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
containers:
- name: rabbitmq-k8s
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 500m
memory: 500Mi
image: rabbitmq:3.7
volumeMounts:
- name: config-volume
mountPath: /etc/rabbitmq
- name: bootstrap-vol
mountPath: /tmp/bootstrap.sh
subPath: bootstrap.sh
readOnly: true
ports:
- name: http
protocol: TCP
containerPort: 15676
- name: amqp
protocol: TCP
containerPort: 5676
livenessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 10
readinessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 20
periodSeconds: 60
timeoutSeconds: 10
imagePullPolicy: IfNotPresent
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_NODENAME
value: "rabbit@$(MY_POD_IP)"
- name: K8S_SERVICE_NAME
value: "rabbitmq"
- name: RABBITMQ_ERLANG_COOKIE
value: "cookie1"
lifecycle:
postStart:
exec:
command:
- /tmp/bootstrap.sh
volumes:
- name: config-volume
configMap:
name: rabbitmq-config
items:
- key: rabbitmq.conf
path: rabbitmq.conf
- key: enabled_plugins
path: enabled_plugins
- name: bootstrap-vol
configMap:
defaultMode: 0700
name: rabbitmq-bootstrap
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 14
- Comments: 21
Same with rancher / rancher: v2.2.3 when scaling down / up “rabbitmq-ha” (from helm) as described by @bentastic27
The issue is that on upscaling the service, the “RABBITMQ_NODENAME” environnent variable seems to be evaluated before “MY_POD_NAME” and therefore RABBITMQ_NODENAME could not rely on MY_POD_NAME which is undefined.
Find below is the interesting part of my StatefulSet to undestand the issue :
Erroneous config file modified by rancher :
RABBITMQ_NODENAME depends on MY_POD_NAME, it must be defined after.
Modified configuration file that works :
This behavior is not limited just to rabbit-mq or apps deployed via helm charts. We’re facing the same when working with
StatefulSet
,Deployment
etc. Issue observed on Rancher 4.2.5.I believe it is related to how Rancher handles yaml specification rendering before deploying it to Kubernetes. From what I can see by testing is that the variables set in
Environment variables
section will be placed above those set inInject values from another resource
section in the final specification. To illustrate here’s example screenshot:How to reproduce And here is resulting yaml file of that same deployment:
And here is resulting environment inside the running container, not exactly what we wanted right?
Workaround for our case Currently we can simply deploy via yaml using Kubectl. Here’s a simple deployment spec, resulting in same deployment as above via Rancher UI. Note that order variables are defined here is reversed from what Rancher has produced:
This time result in running container is what we expect it to be:
The problem with this approach is that whenever someone would edit the deployment via Rancher UI or use Redeploy functionality, the specification would be re-rendered in Rancher’s current style, resulting in broken deployment.
Expected behavior We would certainly expect that we are able to use dependent variables as described in Kubernetes documentation without fear of our deployments getting broken by someone editing them in Rancher UI or Redeploying. I think a very simple solution that should work is to change the order Rancher places variable declarations in the specifications by always placing explicitly defined variables after those injected from other sources (field refs, resource refs etc.).
I am experiencing the same issue, and it’s been particularly frustrating since no matter how many times we attempt to update the file to have the properties in the correct order it just immediately reverts to the existing file. What we end up with is pods that stay broken and we can’t scale back up because the first pod is borked, requiring us to completely remove and recreate the cluster in order to get it working again. In fairness we are running it as a multi-cluster app but I am not sure how that would make a difference. Any insight would be appreciated!
⚠️ Update: this has gotten so out of control. Now it seems that deleting any pod will prevent any new pods from ever being created. It seems the only way to “resolve” this issue is to completely nuke the application from the cluster (remove target from multi-cluster application) and then recreate it from scratch (add target back to multi-cluster application).
Here is the current scenario I just tried:
rabbitmq-ha-p-ndvdf-0
- Failed due to previously deleting the pod in hopes it would come backrabbitmq-ha-p-ndvdf-1
- Healthyrabbitmq-ha-p-ndvdf-2
- Healthyrabbitmq-ha-p-ndvdf-3
- Healthyrabbitmq-ha-p-ndvdf-4
- Healthy0
for some reason, I deletedrabbitmq-ha-p-ndvdf-4
4
again since pod0
is in a crash/reboot loop due to the error this issue is about:I guess I will attempt to remove the project from the multi-cluster app and then re-add it so it creates it from scratch. Once that’s done I’ll try deleting pod
3
to see if it’s all pods that are affected or just0
for some reason.⚠️ BIG UPDATE 2: OMG! This appears to be a zero-index issue!!! I did what I said I would do above, delete pod 4 instead of pod 0, and it worked on the first try in less than 30 seconds! Screenshots incoming:
⚠️ Update 3: Nope, failure. I took it one step further and removed pods 1-4 (still didn’t touch pod 0) and now all the pods I deleted are failing.
How is it even conceivably possible that this chart is in wide use?!? This is whatever the opposite of high availability is.
I am now forced to copy the chart into our own private repo and modify it until it hopefully works. I’ll probably break even more than is already broken, but clearly this chart won’t get fixed anytime soon.
I use rancher v2.5.8 and I have the same problem
Oh, this github issue. I keep hoping never to see it again and yet here we are. Totally surreal.
Is the
rabbitmq-ha
chart abandoned or something? How is it possible that this is still a thing?@rmalchow i tested it on 2.4.5, there is still a problem