redis-operator: Different RedisFailover's sentinels join together

Environment

How are the pieces configured?

Redis Operator version v1.2.2
Kubernetes version v1.23.13
Kubernetes configuration used (eg: Is RBAC active?)

affinity: {}
annotations: {}
container:
  port: 9710
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: quay.io/spotahome/redis-operator
  tag: v1.2.2
imageCredentials:
  create: false
  email: someone@example.com
  existsSecrets:
  - registrysecret
  password: somepassword
  registry: url.private.registry
  username: someone
monitoring:
  enabled: false
  prometheus:
    name: unknown
  serviceAnnotations: {}
  serviceMonitor: false
nameOverride: ""
nodeSelector: {}
replicas: 3
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi
securityContext:
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000
service:
  port: 9710
  type: ClusterIP
serviceAccount:
  annotations: {}
  create: true
  name: ""
tolerations: []
updateStrategy:
  type: RollingUpdate

About this issue

Original URL
State: closed
Created a year ago
Comments: 48 (31 by maintainers)

Commits related to this issue

Use more targeted NetworkPolicies (#25) In #4, we added a NetworkPolicy. The intent was to prevent the Redis and/or Sentinel pods from differing RedisFailovers from joining up with one another (See... — committed to powerhome/redis-operator by indiebrain 7 months ago

Most upvoted comments

Hi! As I was thinking about using this operator I came across this issue and I felt immediately remembered about the problems I had getting this “fixed” with Google cloud running Redis Sentinel clusters on VMs. We have around 20 of these running and from time to time you of course want to upgrade or install OS updates. Since we do immutable infrastructure we just don’t update VMs (via Ansible, Chef, Puppet, …) but we completely replace them.

All our Redis clusters have three instances. So if we update such a cluster we shutdown one Redis node (which always contains one Redis and Sentinel process). We build our OS image with Hashicorp Packer. So our automation picks up the latest OS image built and then a new VM starts and re-joins the cluster. And then this also happens with the other two remaining nodes. We first replace the two replicas and finally the primary. Before shutting down the primary we issue a primary failover.

As long as you do this only with one Redis Sentinel cluster it normally works fine. But since the update process runs in parallel all 20 Redis Sentinel cluster are recreated at the same time. All that works just fine. But at the beginning we also discovered that some nodes suddenly tried to join other clusters during that process. All clusters have a different redisMasterName configured and therefore the join failed of course. To move that nodes back to the cluster they belong we manually had clean the configuration and rejoin that node.

We tried a lot of workarounds but nothing really worked reliable as Redis/Sentinel always “remembered” the IP addresses if the old (now gone) nodes. And that’s actually the problem. During that Redis Sentinel clusters recreation process it was likely that one node got an internal IP address that was previously used by a node that belonged to a different Redis Sentinel cluster.

So our “solution” was to give every VM node fixed internal IP addresses. So they don’t change IP addresses as the IP address is allocated once then is then always assigned to the VM that used it before. That “fixed” this issue once and forever 😃

But AFAIK you can’t do this in Kubernetes. So from what I’ve read so far in this thread using NetworkPolicy or different ports for every Redis Sentinel deployment might be possible workarounds. Since Redis 6.2 you can also use hostnames instead of IP addresses: https://redis.io/docs/management/sentinel/#ip-addresses-and-dns-names Since Kubernetes has DNS autodiscovery this might be a more general solution to this problem.

githubixx on Mar 7, 2023

Thanks for sharing the details @EladDolev , will take some time to test and get back with code changes, will update here…

raghu-nandan-bs on Feb 13, 2023

@tparsa I can confirm that it helps, Thanks for the tip.

zekena2 on Jan 17, 2023

@zekena2 Also deleting all sentinel pods will fix the problem as well.

tparsa on Jan 17, 2023

What I saw in the operator metrics, indicates that the operator does realize there is a problem with the number of sentinels. But the only fix that the operator does is sending a SENTINEL RESET * that doesn’t fix anything.

tparsa on Jan 16, 2023