istio: Upgrading control plane from 1.2.2 to 1.2.5 causing down time

Bug description When I try to upgrade-downgrade between versions 1.2.2 and 1.2.5 my applications which are using sidecar goes into unready state and I see a downtime in my services. My requests follow this path: Load generator (outside cluster) -> Load Balancer (outside cluster) -> Istio Ingressgateway (inside cluster) -> Application (just simple nginx docker image) I have about 20 instances of istio-ingressgateway and 60 instances of nginx and I generate a load of about 15k rps which normally this setup handles without a sweat. What I observe when I do netstat -ltpn inside sidecar proxy is that a new envoy process comes up and old one goes away, this probably causes the application to become unhealthy because this new envoy process isn’t listening on port 15090. After a while it does start listening to 15090 and 15001 and the errors go away once all instances are back. image

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [X] Installation [X] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [X] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior To not see any affect on my traffic when doing control plane upgrade of Istio

Steps to reproduce the bug We consider istio-ingressgateway to also be a part of data plane and don’t want to make any changes to it. We upgrade everything else other than this. CNI is running on version 1.2.5 I try to change versions using these commands - helm template install/kubernetes/helm/istio-init --name istio-init --namespace istio-system | kubectl apply -f -

mkdir tmp mv install/kubernetes/helm/istio/charts/gateways/templates/* tmp/ helm template install/kubernetes/helm/istio/ --namespace istio-system --name istio --values custom.yaml | kubectl -n istio-system apply -f - mv tmp/* install/kubernetes/helm/istio/charts/gateways/templates/ rm -r tmp/

This will temporarily remove all gateway related changes and upgrade everything else.

Version (include the output of istioctl version --remote and kubectl version) Istio - 1.2.2 to 1.2.5 Kubernetes - 1.15.0

How was Istio installed? Using helm template and this custom.yaml for values -

gateways:
  istio-ingressgateway:
    type: NodePort
    autoscaleMin: 20
    autoscaleMax: 20
    ports:
    - port: 80
      targetPort: 80
      name: http2
      nodePort: 60000
    - port: 443
      name: https
      nodePort: 60001
    - port: 31400
      name: tcp
      nodePort: 61400
    resources:
      requests:
        cpu: 2
        memory: 512Mi
      limits:
        cpu: 2
        memory: 512Mi
kiali:
  enabled: true
  dashboard:
    grafanaURL: http://grafana:3000
    jaegerURL: http://tracing:80
  resources:
    requests:
      cpu: 4
      memory: 4096Mi
    limits:
      cpu: 4
      memory: 4096Mi
  createDemoSecret: true
  prometheusAddr: prometheus.internal.com

mixer:
  policy:
    enabled: false
  telemetry:
    autoscaleMin: 30
    autoscaleMax: 100

grafana:
  enabled: true

pilot:
  traceSampling: 100.0
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1
      memory: 2Gi

tracing:
  enabled: false

istio_cni:
  enabled: true

global:
  policyCheckFailOpen: true
  proxy:
    logLevel: "error"
    resources:
      requests:
        cpu: 50m
        memory: 180Mi
      limits:
        cpu: 2
        memory: 512Mi
  defaultResources:
    requests:
      cpu: 1
      memory: 2048Mi
    limits:
      cpu: 2
      memory: 2048Mi

Environment where bug was observed (cloud vendor, OS, etc) On prem k8s cluster running on bare metal

Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 19 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Confirmed this fixed the ACK ERRORS about certs not found on upgrades as well