kiali: Kiali causing errors 503 and 504 at EKS apiserver.

Describe the bug After instaling kiali, multiple 504 errors are show, at seemingly random intervals, for LIST verbs refering to multiple istio objects like gateways,virtualservices,peerauthentication,destinationrules etc.

This problem, besides causing alarms on our monitoring system (which can be adjusted), also seems to cause authentication problems with EKS. Perhaps overloading the apiserver?

I was able to reproduce the scenario on two different clusters, in different AWS accounts.

Additional information: our clusters have 70+ namespaces but, as of now, istio is enabled only in a handful (fewer than 10) of namespaces. However, kiali operator set the label “kiali.io/member-of” on all 70+ of them.

Versions used Kiali: v1.33.1 (179cd6b016cd15deac16266520bb406185508b74) Istio: 1.8.5 Kubernetes flavour and version: v1.16.15-eks-ad4801

To Reproduce Steps to reproduce the behavior:

  1. Install prometheus+grafana and enable cluster monitoring
  2. Install istio
  3. Install kiali
  4. Go to prometheus or grafana endpoints and execute/add a panel with the following query:

sum by(resource, subresource, verb, code) (rate(apiserver_request_total{code=~"5..",job="apiserver"}[5m])) / sum by(resource, subresource, verb, code) (rate(apiserver_request_total{job="apiserver"}[5m])) > 0.05

  1. Let it running for a while and you’ll see multiple errors for verb LIST

Screenshot from 2021-04-22 18-37-07

Expected behavior

I kiali shouldn’t generate errors when trying to list objects it needs to watch.

Extra information: Kiali CR

apiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
  name: kiali
  namespace: istio-system
  annotations:
    ansible.sdk.operatorframework.io/verbosity: "1"
spec:
  auth:
    strategy: "anonymous"
  deployment:
    view_only_mode: true
    ingress_enabled: false
  external_services:
    tracing:
      in_cluster_url: "http://tracing.istio-system/"
      url: "http://tracing.homolog.my.domain/"
      use_grpc: false

Note: grpc is disabled because I was not able to find the correct endpoint (adding or removing /jaeger had no effect).

Please let me know if you need more information.

UPDATE*

With these values below, only the 503 errors appear on our monitoring.

  deployment:
    # Limits to 22 namespaces
    accessible_namespaces:
    - ns-group1-.*
    - ns-group2-.*
    ingress_enabled: false
    logger:
      log_level: debug
    view_only_mode: true
  external_services:
    custom_dashboards:
      discovery_enabled: "false"
  kubernetes_config:
    burst: 50
    cache_duration: 600
    cache_token_namespace_duration: 60
    qps: 10

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 39 (10 by maintainers)

Most upvoted comments

I just thought it a bit confusing that a specific operator version would resort to the latest application by default. I would expect it to deploy the application with the same operator version.

@sergiomacedo there is (was) a reason for that - but I’m going to switch it back. Read this issue I just created if you care about the gory details 😃

I’m using helm to install kiali-operator on my cluster. However, it seems kiali-operator always tries to download the latest image.

When falling back to a previous version of the operator, you can explicitly tell the operator to use a specific image via: https://github.com/kiali/kiali-operator/blob/master/deploy/kiali/kiali_cr.yaml#L250-L264

Not specifying this causes the operator to install the “lastrelease” as defined here: https://github.com/kiali/kiali-operator/blob/master/playbooks/default-supported-images.yml#L1

This brings up an interesting point that is completely unrelated to this issue, but one I need to start thinking about. We recently introduced a feature in which the operator will not allow you to set this image_version field (you will be required to install the version the operator has set by default). This clearly isn’t going to work when you have an older operator but then Kiali releases a new version of the server which may require new/different permissions or CR settings. We may have to change that feature’s behavior. I’ll write a separate github issue for this… this is going to be a problem I think.

Note: this isn’t a problem to worry about with v1.26. So set image_version and it will work.