thanos: thanos+ingress-nginx+grpc: impossible setup due missing host header

Thanos, Prometheus and Golang version used quay.io/thanos/thanos:v0.7.0

What happened i setup 2 kubernetes clusters, thanos query is in one cluster (and a local prometheus+sidecar) and need to query the remote kubernetes cluster thanos sidecar, all running in AWS (but not using eks) I created one ingress-nginx with support for grpc with this config:

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: monitoring-ingress
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: prometheus-k8s-live-a.ops.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus-k8s-live-a
          servicePort: 9090
  - host: prometheus-k8s-live-b.ops.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus-k8s-live-b
          servicePort: 9090
  tls:
  - hosts:
    - prometheus-k8s-live-a.ops.example.com
    - prometheus-k8s-live-b.ops.example.com
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
  name: grpc-ingress
  namespace: monitoring
spec:
  rules:
  - host: sidecar-k8s-live-a.ops.example.com
    http:
      paths:
      - backend:
          serviceName: sidecar-k8s-live-a
          servicePort: 10911
  - host: sidecar-k8s-live-b.ops.example.com
    http:
      paths:
      - backend:
          serviceName: sidecar-k8s-live-b
          servicePort: 10911
  tls:
  - hosts:
      - sidecar-k8s-live-a.ops.example.com
      - sidecar-k8s-live-b.ops.example.com

thanos query is using

--store=sidecar-k8s-live-a.ops.example.com.:443
--store=sidecar-k8s-live-a.ops.example.com.:443

I can connect to the prometheus url, but the sidecar grpc fail in thanos query. looking to the nginx logs i can see the query arriving in http2, but returning 400. Doing a curl i can get a 503, but probably just because it is not really grpc. Changing the ingress-nginx logs to show the host header, i can see that curl is sending the correct host header, but for thanos query the logs show only _, it is either sending a empty one or a _.

What you expected to happen I wanted to share the ingress to receive the https requests for prometheus and the grpc and using the host to redirect the request to the correct service. Sadly thanos query fail to send the host header and so the nginx can’t apply the virtual_host search and servers the request from the default site.

Full logs to relevant components

Logs

172.27.119.135 - [172.27.119.135] - - [10/Sep/2019:15:02:40 +0000] "PRI * HTTP/2.0" 400 163 "-" "-" 0 0.001 [] [] - - - - 477873c7a336618ccf06cf9c03fe8d97
172.27.119.135 - [172.27.119.135] - - [10/Sep/2019:15:02:40 +0000] "PRI * HTTP/2.0" 400 163 "-" "-" 0 0.003 [] [] - - - - c32e68975e91159a64326b55d4b72934
2019/09/10 15:02:40 [error] 1137#1137: *7155 upstream rejected request with error 2 while reading response header from upstream, client: 172.26.81.74, server: sidecar-k8s-live-a.ops.example.com, request: "PRI / HTTP/1.1", upstream: "grpc://100.96.136.200:10911", host: "sidecar-k8s-live-a.ops.example.com"
172.26.81.74 - [172.26.81.74] - - [10/Sep/2019:15:02:40 +0000] "PRI / HTTP/1.1" 502 163 "-" "curl/7.58.0" 189 0.002 [monitoring-sidecar-k8s-live-a-10911] [] 100.96.136.200:10911 0 0.004 502 4e08c4e8c6d8df148c5bc3a68d61ccf9

here we can see that the thanos query requests do not trigger the virtual_host, but the curl one, with host, is redirected to thanos sidecar

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 11
  • Comments: 58 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Another work around with the NGINX Ingress Controller is to use the --grpc-client-server-name flag on your thanos-query. This uses Server Name Indication, allowing the ingress controller to route the request correctly.

I believe this limits each querier to one server name only. Therefore you will need multiple queriers if you have multiple clusters to communicate between.

Your thanos-query args would include:

--grpc-client-server-name=sidecar-k8s-live.ops.example.com
--grpc-client-tls-secure
--store=dns+sidecar-k8s-live.ops.example.com:443

And your ingress annotations would include:

nginx.ingress.kubernetes.io/backend-protocol: GRPC
nginx.ingress.kubernetes.io/ssl-redirect: "true"

Apparently this is still needed and valid.

After a while of pulling my hair with this one, I managed to make it work. Just a note here, my ingress is on the Query instance not the sidecar, I would assume it’d work the same way for sidecar (didn’t test that part)

My architecture is as follows:

Query (central cluster) -> Query (remote cluster :: ingress on this one) -> Sidecar (remote cluster) 
                        -> Sidecar (central cluster)

I’m deploying the stack with helm, here is my config

Remote & Central Cluster Prometheus Operator

prometheus:
  prometheusSpec:
    thanos:
      image: docker.io/bitnami/thanos
      tag: 0.17.2-scratch-r2
      objectStorageConfig:
        name: thanos
        key: objstore.yml

Remote Cluster Query Config

existingObjstoreSecret: objstorage
clusterDomain: cluster.local
query:
  dnsDiscovery:
    enabled: false
  stores:
    - kube-prometheus-prometheus-thanos.monitoring:10901 ## <-- thanos-sidecar

  ingress:
    enabled: false # disabled for http
    grpc:
      enabled: true
      annotations:
        kubernetes.io/ingress.class: nginx-internal
        nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
        ingress.kubernetes.io/ssl-redirect: "true"

      hostname: thanos.query.domain.local
      extraTls:
        - hosts:
            - thanos.query.domain.local
          secretName: thanos-grpc-tls

Central Cluster Query Config

existingObjstoreSecret: objstorage
clusterDomain: cluster.local
query:
  dnsDiscovery:
    enabled: false
  stores:
    ## this setup requires the thanos-sidecar tls to be 
    ## enabled. If you don't want to enable thanos-sidecar tls, you can modify the central cluster config by
    ## 1. create two query instances in the central cluster
    ## 2. first query instance has tls enabled on the client and store urls should only be the remote clusters' 
    ## 3. second query instance will point to the first query by service name, and to the local thanos-sidecar 
    - kube-prometheus-prometheus-thanos.monitoring:10901 
    - thanos.query.domain.local:443
  grpcTLS:
    client:
      secure: true
      existingSecret:
        name: thanos-grpc-tls
        keyMapping:
          ca-cert: ca.crt
          tls-cert: tls.crt
          tls-key: tls.key

Notice the certificate used for query ingress and for client TLS is the same certificate. I hope this helps someone

Still valid and help wanted.

I have a similar issue, I have a multi-cluster setup where each cluster has one query and one sidecar. And there is an observability cluster that has a query instance that points to all query instances in other clusters. Query instances are exposed through an ingress that has a backend on the GRPC port

The query and sidecar in clusters are working, but I can’t create stores pointing to query.my-local-domain.local:443, I keep getting this error rpc error: code = DeadlineExceeded desc = latest balancer error: connection closed

Here is my ingress annotations:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: thanos-ing-test
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: nginx-internal
    cert-manager.io/cluster-issuer: selfsigned
    ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/grpc-backend: "true"
    # nginx.ingress.kubernetes.io/x-forwarded-prefix: "true"
    # nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
    # nginx.ingress.kubernetes.io/auth-tls-secret: monitoring/thanos-grpc-tls
    # nginx.ingress.kubernetes.io/auth-tls-verify-depth: "1"
    # nginx.ingress.kubernetes.io/force-ssl-redirect: 'true'
    # nginx.ingress.kubernetes.io/protocol: h2c
    # nginx.ingress.kubernetes.io/proxy-read-timeout: '160'
    external-dns.alpha.kubernetes.io/hostname: query.my-local-domain.local
    external-dns.alpha.kubernetes.io/target: nginx.my-local-domain.local
spec:
  rules:
  - host: query.my-local-domain.local
    http:
      paths:
      - backend:
          serviceName: thanos-query
          servicePort: 10901
  tls:
  - hosts:
    - query.my-local-domain.local
    secretName: thanos-query-ingress-tls

The ingress controller is an Azure Internal Load Balancer. Am I missing anything?

P.S: tls is not enabled on the sidecar nor the query instances

I had the same problem

Hello for me it’s work when you add the extraflag --grpc-client-tls-secure and on the observee cluster i havec certman activated

For anyone who’s bashing their heads against this, this single line fixed it; we have ingress enabled in both observer and remote.

using bitnami kube-prometheus and bitnami thanos on eks 1.21

heres the values for thanos:

"bucketweb":
  "enabled": true
"compactor":
  "enabled": true
"minio":
  "auth":
    "rootPassword": "password"
    "rootUser": "user"
  "defaultBuckets": "thanos"
  "enabled": true
"objstoreConfig": |
  "config":
    "access_key": "user"
    "bucket": "thanos"
    "endpoint":minio.thanos-grafana.svc.cluster.local:9000
    "insecure": true
    "secret_key": "password"
  "type": "s3"
"query":
  "stores":
  - "thanos.cool-1.foobar.io:443"
  - "thanos.cool-2.foobar.io:443"
  "extraFlags":
  - "--grpc-client-tls-secure"
  # - "--grpc-client-server-name=kube-prometheus-prometheus-thanos"
  "dnsDiscovery":
    "enabled": false
  "ingress":
    "grpc": 
      "enabled": true
      "hostname": "thanos-querier.foobar.io"
      "tls": true"
      "annotations":
        "cert-manager.io/cluster-issuer": "letsencrypt-prod"
        "kubernetes.io/ingress.class": "nginx"
        "nginx.ingress.kubernetes.io/backend-protocol": "GRPC"
        "nginx.ingress.kubernetes.io/ssl-redirect": "true"
        "nginx.ingress.kubernetes.io/grpc-backend": "true"
"ruler":
  "alertmanagers":
  - "http://prometheus-operator-alertmanager.thanos-grafana.svc.cluster.local:9093"
  "config": |
    "groups":
    - "name": "metamonitoring"
      "rules":
      - "alert": "PrometheusDown"
        "expr": absent(up{prometheus="thanos-grafana/prometheus-operator"})
  "enabled": true
"storagegateway":
  "enabled": true

and for kube-prometheus:

"prometheus":
  "externalLabels":
    "cluster": "foobar"
  "thanos":
    "create": true
    "ingress":
      "annotations":
        "cert-manager.io/cluster-issuer": "letsencrypt-prod"
        "kubernetes.io/ingress.class": "nginx"
        "nginx.ingress.kubernetes.io/backend-protocol": "GRPC"
        "nginx.ingress.kubernetes.io/force-ssl-redirect": "true"
        "nginx.ingress.kubernetes.io/grpc-backend": "true"
        "nginx.ingress.kubernetes.io/protocol": "h2c"
        "nginx.ingress.kubernetes.io/proxy-read-timeout": "160"
      "enabled": true
      "hosts":
      - "name": "thanos.cool-1.foobar.io"
      "tls":
        - "hosts":
          - "thanos.cool-1.foobar.io"
          "secretName": "foobar-thanos-tls-secret"

I had a same problem My solution was to use bitnami charts

Depends: https://github.com/bitnami/charts/pull/5345 https://github.com/bitnami/charts/pull/5344

my bitnami/kube-prometheus custom values:

prometheus:
  disableCompaction: true
  thanos:
    create: true
    objectStorageConfig:
      secretName: thanos-objstore-config
      secretKey: objstore.yml
    ingress:
      enabled: true
      annotations:
        kubernetes.io/ingress.class: nginx
        nginx.ingress.kubernetes.io/ssl-redirect: "true"
        nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
        nginx.ingress.kubernetes.io/auth-tls-secret: monitoring/thanos-certs
        nginx.ingress.kubernetes.io/backend-protocol: GRPC

my bitnami/thanos custom values:

existingObjstoreSecret: thanos-objstore-config
query:
  hostAliases:
  - ip: "111.11.111.1"
    hostnames:
    - thanos.earth.cluster
  - ip: "111.11.112.1"
    hostnames:
    - thanos.mars.cluster
  stores:
  - thanos.earth.cluster:443
  - thanos.mars.cluster:443
  - thanos-storegateway.default.svc.cluster.local:10901
  dnsDiscovery:
    enabled: false
  grpcTLS:
    client:
      secure: true
      cert: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
      key: |-
        -----BEGIN PRIVATE KEY-----
        ...
        -----END PRIVATE KEY-----
      ca: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
compactor:
  enabled: true
storegateway:
  enabled: true
  grpc:
    tls:
      enabled: true
      cert: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
      key: |-
        -----BEGIN PRIVATE KEY-----
        ...
        -----END PRIVATE KEY-----
      ca: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----

thanos

Possible fix pushed that uses a flag to change behaviour based around the workaround detailed by @cjf-fuller in https://github.com/thanos-io/thanos/issues/1507#issuecomment-580820712.

If “grpc-client-dns-server-name” flag is specified then use the DNS provider to return back the name that was originally looked up and add the relevant dial options for the grpc at connection time. Allows a different SNI per store, based on the originally provided (dns+<name>:<port>) name.

A reasonable way to work around this with NGINX Ingress Controller is to use the tcp-services-configmap feature to expose ports that route directly to sidecar-k8s-live-a:10911 (e.g. 11911) and sidecar-k8s-live-b:10911 (e.g. 12911) respectively.

Then your thanos-query options would look something like:

--store=sidecar-k8s-live.ops.example.com:11911  # routes to sidecar-k8s-live-a:10911
--store=sidecar-k8s-live.ops.example.com:12911  # routes to sidecar-k8s-live-b:10911

You still have to set up TLS on your own in both thanos-query and thanos-sidecar, but it helps avoid all the HTTP routing that the ingress controller tries to do for you.

Hello for me it’s work when you add the extraflag --grpc-client-tls-secure and on the observee cluster i havec certman activated

@j3p0uk I have tried this with grpcurl from the central cluster grpcurl -insecure query.my-local-domain.local:443 list

I’m getting this response:

grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
thanos.Rules
thanos.Store

I did a describe as well grpcurl -insecure query.my-local-domain.local:443 describe and this is the output

grpc.health.v1.Health is a service:
service Health {
  rpc Check ( .grpc.health.v1.HealthCheckRequest ) returns ( .grpc.health.v1.HealthCheckResponse );
  rpc Watch ( .grpc.health.v1.HealthCheckRequest ) returns ( stream .grpc.health.v1.HealthCheckResponse );
}
grpc.reflection.v1alpha.ServerReflection is a service:
service ServerReflection {
  rpc ServerReflectionInfo ( stream .grpc.reflection.v1alpha.ServerReflectionRequest ) returns ( stream .grpc.reflection.v1alpha.ServerReflectionResponse );
}
Failed to resolve symbol "thanos.Rules": Symbol not found: thanos.Rules

Then grpcurl -insecure query.my-local-domain.local:443 thanos.Store/Info and this is the output

Error invoking method "thanos.Store/Info": target server does not expose service "thanos.Store"

What if don’t want to use tls at all? Not for internal communication (which is default), and not for external queries. Is it possible?

@Than0s-coder, great point, we have set up a “central” Querier to target a “leaf” Querier and not the sidecars directly. But it sounds like this risk of overwriting the initial query and loss of host_headers would still be present?

@martip07, I am still very much a beginner with Thanos so could be totally wrong here. But, as far as I can tell the --grpc-client-server-name argument is a string, that sets ServerName in tls.Config. I am not too sure how I would make this a list of servernames.

I have seen that the TLS Extensions documentation talks of a ServerNameList struct. I cannot find many examples of this being used. I have tested this with a simple comma separated list (--grpc-client-server-name=test-1.myorg.com,test-2.myorg.com) which fails at the SSL handshake because the list is not enumerated at any point. So it fails as the wildcard certificate is valid for “*.myorg.com” and not “test-1.myorg.com,test-2.myorg.com”

Found a reference for this problem in a several months old issue (not directly related to this) https://github.com/thanos-io/thanos/issues/977#issuecomment-483679010 and it basically confirms the problem