kind: "etcdserver: request timed out"
What happened:
We started using KinD in our prow integration tests for Istio, and occasionally are seeing errors like
Error from server: error when creating "/logs/artifacts/galley-test-a99cf400cb3343eeac7/_suite_context/istio-deployment-577548603/istio-config-only.yaml": etcdserver: request timed out
What you expected to happen: etcd doesn’t timeout.
How to reproduce it (as minimally and precisely as possible): This is the hard part, I am not sure how to reproduce this consistently. I do, however, have a bunch of logs from where it occurred - attached below.
I realize this is probably not a very actionable bug report, my main question is what info do we need to collect to root cause this?
Environment:
- kind version: (use
kind version
):v0.3.0
- Kubernetes version: (use
kubectl version
): I think 1.14? That is what comes with 0.3.0 right? - Docker version: (use
docker info
): 18.06.1-ce - OS (e.g. from
/etc/os-release
): Ubuntu 16.04
Logs:
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15405/integ-new-install-k8s-presubmit-tests-master/461
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15413/integ-security-k8s-presubmit-tests-master/2777
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15420/integ-security-k8s-presubmit-tests-master/2785
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15420/integ-security-k8s-presubmit-tests-master/2971
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15422/integ-mixer-k8s-presubmit-tests-master/2739
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15422/integ-mixer-k8s-presubmit-tests-master/2764
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15422/integ-security-k8s-presubmit-tests-master/2961
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15424/integ-security-k8s-presubmit-tests-master/2744
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15440/integ-security-k8s-presubmit-tests-master/2950
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15445/integ-mixer-k8s-presubmit-tests-master/2628
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15462/integ-security-k8s-presubmit-tests-master/2787
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15463/integ-security-k8s-presubmit-tests-master/2761
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15464/integ-security-k8s-presubmit-tests-master/2815
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15470/integ-new-install-k8s-presubmit-tests-master/344
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15470/integ-security-k8s-presubmit-tests-master/2778
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15481/integ-mixer-k8s-presubmit-tests-master/2605
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15484/integ-telemetry-k8s-presubmit-tests-master/1745
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15491/integ-security-k8s-presubmit-tests-master/2837
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15495/integ-mixer-k8s-presubmit-tests-master/2632
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15496/integ-security-k8s-presubmit-tests-master/2941
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15503/integ-istioctl-k8s-presubmit-tests-master/1828
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15508/integ-mixer-k8s-presubmit-tests-master/2738
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15547/integ-security-k8s-presubmit-tests-master/2967
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15548/integ-security-k8s-presubmit-tests-master/2965
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15552/integ-pilot-k8s-presubmit-tests-master/2927
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 21 (21 by maintainers)
I’ve applied the above across the istio test cluster. We should check it in to istio/test-infra so it accurately reflects reality.
great, thanks for all the help! Now that we have the logs being dumped hopefully we can track down future issues faster now too.
@Katharine can you help us get the limit raised? I don’t have access to the cluster
Ok so dug into it a bit more. With the retry PR, it doesn’t seem to help. Either kind comes up, or it fails 3 times - it never seems to fail once then succeed.
I ran another test now that the cluster is pretty much empty. The node had no pods running on it until the test was scheduled on it. The first step of the test is setting up the kind cluster: https://gubernator.k8s.io/build/istio-prow/pr-logs/pull/istio_istio/15642/integ-galley-k8s-presubmit-tests-master/3435/
Looking at the node metrics, IO write bytes throttle is peaking at 15mb/s. So even just the
kind create cluster
is causing it to get throttled to some extent - not sure how normal that is.I did the retaining + log dump, hopefully that can help. Example failure is at https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/15637/integ-telemetry-k8s-presubmit-tests-master/2039, artifiacts will have all the logs
From these logs the most obvious looking error:
There is also this:
We had some other tests where we got kind running, but then one of our containers running in kind failed to start a server with “no space left on device”. I am pretty sure this is not literal disk space, but inotify limit? It seems it repeats that no space left error about 10x then it finally exits, which leads me to believe that is the root cause.
So seems like maybe increasing
fs.inotify.max_user_watches
may resolve this?We are getting this one a lot too, I suspect the same problem. Looks like kubeadm is health checking and timing out