kubeflow: ambassadors are crashed and cannot be created

Hi,

I used ansible to setup my kubenetes cluster from binary, and also followed user guide to setup kubeflow. But I found that ambassador cannot be created.

ambassador logs are also attached. Thanks

root@master:~# kubectl get pod -n=kubeflow
NAME                              READY     STATUS             RESTARTS   AGE
ambassador-7987df44b9-962wh       1/2       CrashLoopBackOff   16         1h
ambassador-7987df44b9-nnf2w       1/2       CrashLoopBackOff   16         1h
ambassador-7987df44b9-p2zp9       1/2       CrashLoopBackOff   16         1h
tf-hub-0                          1/1       Running            0          1h
tf-job-operator-78757955b-gkv52   1/1       Running            0          1h

root@master:~# kubectl -n=kubeflow logs ambassador-7987df44b9-962wh ambassador

./entrypoint.sh: set: line 63: can't access tty; job control turned off
/usr/lib/python3.6/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
2018-03-05 18:47:53 kubewatch 0.26.0 INFO: Merging config inputs from /etc/ambassador-config
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: update: including key k8s-dashboard-kubeflow.yaml
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: Scheduling restart
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: Changes detected, regenerating envoy config.
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: Wrote k8s-dashboard-kubeflow.yaml to /etc/ambassador-config-1/k8s-dashboard-kubeflow.yaml
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: generating config with gencount 1
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: PROCESS: k8s-dashboard-kubeflow.yaml.1 => k8s-dashboard-kubeflow.yaml
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: PROCESS: k8s-dashboard-kubeflow.yaml.1 => service k8s-dashboard, namespace kubeflow
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: CLUSTER cluster_127_0_0_1_8877: new from --internal--
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: CLUSTER cluster_127_0_0_1_8877: referenced by --internal--
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: CLUSTER cluster_127_0_0_1_8877: referenced by --internal--
2018-03-05 18:47:54 kubewatch 0.26.0 INFO: CLUSTER cluster_kubernetes_dashboard_kube_system_otls: new from k8s-dashboard-kubeflow.yaml.1
2018-03-05 18:47:59 kubewatch 0.26.0 WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6135db0898>: Failed to establish a new connection: [Errno -3] Try again',))
2018-03-05 18:47:59 kubewatch 0.26.0 INFO: Scout reports {"latest_version": "0.26.0", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6135db0898>: Failed to establish a new connection: [Errno -3] Try again',))", "cached": false, "timestamp": 1520275674.049535}
[2018-03-05 18:47:59.072][8][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized
[2018-03-05 18:47:59.072][8][info][config] source/server/configuration_impl.cc:55] loading 1 listener(s)
[2018-03-05 18:47:59.078][8][info][config] source/server/configuration_impl.cc:95] loading tracing configuration
[2018-03-05 18:47:59.078][8][info][config] source/server/configuration_impl.cc:122] loading stats sink configuration
2018-03-05 18:47:59 kubewatch 0.26.0 INFO: Configuration /etc/ambassador-config-1-envoy.json valid
2018-03-05 18:47:59 kubewatch 0.26.0 INFO: Moved valid configuration /etc/ambassador-config-1-envoy.json to /etc/envoy-1.json
AMBASSADOR: starting diagd
AMBASSADOR: starting Envoy
AMBASSADOR: waiting
PIDS: 9:diagd 10:envoy 11:kubewatch
[2018-03-05 18:47:59.227][12][info][main] source/server/server.cc:184] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2018-03-05 18:47:59.508][12][info][config] source/server/configuration_impl.cc:55] loading 1 listener(s)
[2018-03-05 18:47:59.599][12][info][config] source/server/configuration_impl.cc:95] loading tracing configuration
[2018-03-05 18:47:59.599][12][info][config] source/server/configuration_impl.cc:122] loading stats sink configuration
[2018-03-05 18:47:59.600][12][info][main] source/server/server.cc:359] starting main dispatch loop
/usr/lib/python3.6/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Merging config inputs from /etc/ambassador-config
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Merging config inputs from /etc/ambassador-config-1
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Loaded /etc/ambassador-config-1/k8s-dashboard-kubeflow.yaml
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Event: ADDED default/kubernetes
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Event: ADDED kubeflow/tf-hub-0
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Event: ADDED kubeflow/tf-hub-lb
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Event: ADDED kubeflow/ambassador
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Event: ADDED kubeflow/ambassador-admin
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Event: ADDED kubeflow/k8s-dashboard
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: update: including key k8s-dashboard-kubeflow.yaml
2018-03-05 18:48:00 kubewatch 0.26.0 INFO: Scheduling restart
/usr/lib/python3.6/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
2018-03-05 18:48:00 diagd 0.26.0 INFO: PROCESS: k8s-dashboard-kubeflow.yaml.1 => k8s-dashboard-kubeflow.yaml
2018-03-05 18:48:00 diagd 0.26.0 INFO: PROCESS: k8s-dashboard-kubeflow.yaml.1 => service k8s-dashboard, namespace kubeflow
2018-03-05 18:48:00 diagd 0.26.0 INFO: CLUSTER cluster_127_0_0_1_8877: new from --internal--
2018-03-05 18:48:00 diagd 0.26.0 INFO: CLUSTER cluster_127_0_0_1_8877: referenced by --internal--
2018-03-05 18:48:00 diagd 0.26.0 INFO: CLUSTER cluster_127_0_0_1_8877: referenced by --internal--
2018-03-05 18:48:00 diagd 0.26.0 INFO: CLUSTER cluster_kubernetes_dashboard_kube_system_otls: new from k8s-dashboard-kubeflow.yaml.1
2018-03-05 18:48:05 kubewatch 0.26.0 INFO: Processing 1 changes
2018-03-05 18:48:05 kubewatch 0.26.0 INFO: Wrote k8s-dashboard-kubeflow.yaml to /etc/ambassador-config-2/k8s-dashboard-kubeflow.yaml
2018-03-05 18:48:05 kubewatch 0.26.0 INFO: generating config with gencount 2
2018-03-05 18:48:05 kubewatch 0.26.0 INFO: PROCESS: k8s-dashboard-kubeflow.yaml.1 => k8s-dashboard-kubeflow.yaml
2018-03-05 18:48:05 kubewatch 0.26.0 INFO: PROCESS: k8s-dashboard-kubeflow.yaml.1 => service k8s-dashboard, namespace kubeflow
2018-03-05 18:48:05 kubewatch 0.26.0 INFO: CLUSTER cluster_127_0_0_1_8877: new from --internal--
2018-03-05 18:48:05 kubewatch 0.26.0 INFO: CLUSTER cluster_127_0_0_1_8877: referenced by --internal--
2018-03-05 18:48:05 kubewatch 0.26.0 INFO: CLUSTER cluster_127_0_0_1_8877: referenced by --internal--
2018-03-05 18:48:05 kubewatch 0.26.0 INFO: CLUSTER cluster_kubernetes_dashboard_kube_system_otls: new from k8s-dashboard-kubeflow.yaml.1
2018-03-05 18:48:05 diagd 0.26.0 WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc173d6be48>: Failed to establish a new connection: [Errno -3] Try again',))
2018-03-05 18:48:05 diagd 0.26.0 INFO: Scout reports {"latest_version": "0.26.0", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc173d6be48>: Failed to establish a new connection: [Errno -3] Try again',))", "cached": false, "timestamp": 1520275680.312504}
2018-03-05 18:48:10 kubewatch 0.26.0 WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff1e95810f0>: Failed to establish a new connection: [Errno -3] Try again',))
2018-03-05 18:48:10 kubewatch 0.26.0 INFO: Scout reports {"latest_version": "0.26.0", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff1e95810f0>: Failed to establish a new connection: [Errno -3] Try again',))", "cached": false, "timestamp": 1520275685.161717}
[2018-03-05 18:48:10.183][24][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized
[2018-03-05 18:48:10.183][24][info][config] source/server/configuration_impl.cc:55] loading 1 listener(s)
[2018-03-05 18:48:10.189][24][info][config] source/server/configuration_impl.cc:95] loading tracing configuration
[2018-03-05 18:48:10.189][24][info][config] source/server/configuration_impl.cc:122] loading stats sink configuration
2018-03-05 18:48:10 kubewatch 0.26.0 INFO: Configuration /etc/ambassador-config-2-envoy.json valid
2018-03-05 18:48:10 kubewatch 0.26.0 INFO: Moved valid configuration /etc/ambassador-config-2-envoy.json to /etc/envoy-2.json
unable to initialize hot restart: previous envoy process is still initializing
starting hot-restarter with target: /application/start-envoy.sh
forking and execing new child process at epoch 0
forked new child process with PID=12
got SIGHUP
forking and execing new child process at epoch 1
forked new child process with PID=25
got SIGCHLD
PID=25 exited with code=1
Due to abnormal exit, force killing all child processes and exiting
force killing PID=12
exiting due to lack of child processes
AMBASSADOR: envoy exited with status 1
Here's the envoy.json we were trying to run with:
{
  "listeners": [

    {
      "address": "tcp://0.0.0.0:80",
      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {
            "codec_type": "auto",
            "stat_prefix": "ingress_http",
            "access_log": [
              {
                "format": "ACCESS [%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\"\n",
                "path": "/dev/fd/1"
              }
            ],
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "backend",
                  "domains": ["*"],"routes": [

                    {
                      "timeout_ms": 3000,"prefix": "/ambassador/v0/check_ready","prefix_rewrite": "/ambassador/v0/check_ready",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_127_0_0_1_8877", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/ambassador/v0/check_alive","prefix_rewrite": "/ambassador/v0/check_alive",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_127_0_0_1_8877", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/ambassador/v0/","prefix_rewrite": "/ambassador/v0/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_127_0_0_1_8877", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/k8s/ui/","prefix_rewrite": "/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_kubernetes_dashboard_kube_system_otls", "weight": 100.0 }

                          ]
                      }

                    }


                  ]
                }
              ]
            },
           "filters": [
              {
                "name": "cors",
                "config": {}
              },{"type": "decoder",
                "name": "router",
                "config": {}
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "address": "tcp://127.0.0.1:8001",
    "access_log_path": "/tmp/admin_access_log"
  },
  "cluster_manager": {
    "clusters": [
      {
        "name": "cluster_127_0_0_1_8877",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://127.0.0.1:8877"
          }

        ]},
    {
        "name": "cluster_kubernetes_dashboard_kube_system_otls",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://kubernetes-dashboard.kube-system:443"
          }

        ],
        "ssl_context": {

        }}

    ]
  },
  "statsd_udp_ip_address": "127.0.0.1:8125",
  "stats_flush_interval_ms": 1000
}AMBASSADOR: shutting down

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 18 (9 by maintainers)

Commits related to this issue

Restricting controller access with Gatekeeper (#344) — committed to arrikto/kubeflow by richardsliu 5 years ago

Most upvoted comments

@gxfun @pineking @jiaanguo We met the same issue, you should make sure that your dns is working properly. In my scene, pods can access the internet, but can’t resolve the domain ‘kubernaut.io’. After configuring a upstream nameservers (such as 8.8.8.8) in my kubernetes dns, everything goes fine. We use coreDns take the place of kube-dns, you can find how to config upstream nameservers here: kube-dns: https://kubernetes.io/blog/2017/04/configuring-private-dns-zones-upstream-nameservers-kubernetes/ core-dns: https://coredns.io/plugins/kubernetes/

cquptEthan on Jun 29, 2018

Thanks @cquptEthan for the solution.

For those who just got started with kubernetes in general. Here’s what you need to do

kubectl edit configmap coredns -n kube-system

NOTE: the changes are permanent and will survive a reboot of the node.

You’ll need to add 8.8.8.8 at two places, upstream and proxy

...
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream 8.8.8.8 8.8.4.4
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf 8.8.8.8 8.8.4.4
        cache 30
        reload
    }

...

might need to manuall delete the coredns pods to get the changes to be registered

etheleon on Jan 6, 2019