kubernetes: AWS: Master stops responding on t2.micros
I’m trying to break #13997 into its component issues.
An issue which I was able to reproduce: when launching on t2.micros, after a little less than 24 hours, the master stopped responding to kubectl get nodes. I was also unable to SSH in to the instance. When I rebooted the master instance through the AWS API it did come back online (the instance took 30 minutes to reboot, but I don’t think that’s a k8s issue).
The client was giving a TLS handshake timeout error. I was able to capture the logs after reboot, and noticed these messages shortly before the node went offline:
I1221 14:43:42.656637 10 trace.go:57] Trace "List *api.ServiceList" (started 2015-12-21 14:43:41.200342739 +0000 UTC):
[16.972µs] [16.972µs] About to list etcd node
[858.160862ms] [858.14389ms] Etcd node listed
[1.250900537s] [392.739675ms] Node list decoded
[1.452814222s] [201.913685ms] END
I1221 14:43:42.662117 10 trace.go:57] Trace "List *api.ServiceList" (started 2015-12-21 14:43:41.186539275 +0000 UTC):
[13.794002ms] [13.794002ms] About to list directory
[1.475552892s] [1.46175889s] List extracted
[1.475554865s] [1.973µs] END
I1221 14:43:42.729089 10 trace.go:57] Trace "List *api.EndpointsList" (started 2015-12-21 14:43:41.207543963 +0000 UTC):
[7.815µs] [7.815µs] About to list etcd node
[858.39424ms] [858.386425ms] Etcd node listed
[1.252084981s] [393.690741ms] Node list decoded
[1.417321937s] [165.236956ms] END
I1221 14:43:42.734556 10 trace.go:57] Trace "List *api.EndpointsList" (started 2015-12-21 14:43:41.207537621 +0000 UTC):
[2.661µs] [2.661µs] About to list directory
[1.526977443s] [1.526974782s] List extracted
[1.526980882s] [3.439µs] END
I1221 14:43:48.786497 10 handlers.go:131] GET /api/v1/services: (7.657382831s) 200 [[kube-proxy/v1.1.3 (linux/amd64) kubernetes/6a81b50] 172.20.0.126:51647]
I1221 14:43:51.725595 10 trace.go:57] Trace "List *api.PersistentVolumeList" (started 2015-12-21 14:43:45.667240613 +0000 UTC):
[37.799µs] [37.799µs] About to list etcd node
[4.059455957s] [4.059418158s] Etcd node listed
[4.299796567s] [240.34061ms] Node list decoded
[5.298598648s] [998.802081ms] END
I1221 14:43:51.906564 10 trace.go:57] Trace "List *api.PersistentVolumeList" (started 2015-12-21 14:43:45.996141787 +0000 UTC):
[25.837µs] [25.837µs] About to list etcd node
[3.929678631s] [3.929652794s] Etcd node listed
[4.220383064s] [290.704433ms] Node list decoded
[4.944204619s] [723.821555ms] END
I1221 14:43:52.010993 10 trace.go:57] Trace "List *api.PersistentVolumeList" (started 2015-12-21 14:43:45.986129977 +0000 UTC):
[10.007693ms] [10.007693ms] About to list directory
[6.022202404s] [6.012194711s] List extracted
[6.022205235s] [2.831µs] END
A minute later I got the first of these errors:
I1221 14:44:14.656358 10 logs.go:40] http: TLS handshake error from 172.20.0.127:54092: EOF
Both of those messages started repeating, but then the service appeared to recover for a few minutes, before getting worse again.
Then finally I started seeing this sort of error:
I1221 15:01:14.035065 10 logs.go:40] http: TLS handshake error from 172.20.0.126:51773: EOF
I1221 15:02:08.519787 10 handlers.go:131] GET /api/v1/watch/nodes?fieldSelector=metadata.name%3Dip-172-20-0-126.us-west-2.compute.internal&resourceVersion=32843: (34m20.111665496s) 0 [[kubelet/v1.1.3 (linux/amd64) kubernetes/6a81b50] 172.20.0.126:50117]
I1221 15:02:08.749052 10 handlers.go:131] GET /api/v1/persistentvolumes: (3m7.972076083s) 504
goroutine 101628 [running]:
k8s.io/kubernetes/pkg/httplog.(*respLogger).WriteHeader(0xc20a0ea480, 0x1f8)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/httplog/log.go:188 +0x9a
k8s.io/kubernetes/pkg/apiserver.(*baseTimeoutWriter).timeout(0xc20b03eca0, 0x0, 0x0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:244 +0xbf
k8s.io/kubernetes/pkg/apiserver.(*timeoutHandler).ServeHTTP(0xc208304060, 0x7fcfc555ea60, 0xc20a0ea480, 0xc20b247ba0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:168 +0x355
k8s.io/kubernetes/pkg/apiserver.func·013(0x7fcfc555ea28, 0xc209402640, 0xc20b247ba0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/apiserver/handlers.go:130 +0x1cb
net/http.HandlerFunc.ServeHTTP(0xc20835f740, 0x7fcfc555ea28, 0xc209402640, 0xc20b247ba0)
/usr/src/go/src/net/http/server.go:1265 +0x41
net/http.serverHandler.ServeHTTP(0xc208adeea0, 0x7fcfc555ea28, 0xc209402640, 0xc20b247ba0)
/usr/src/go/src/net/http/server.go:1703 +0x19a
net/http.(*conn).serve(0xc209402460)
/usr/src/go/src/net/http/server.go:1204 +0xb57
created by net/http.(*Server).Serve
/usr/src/go/src/net/http/server.go:1751 +0x35e
[[kube-controller-manager/v1.1.3 (linux/amd64) kubernetes/6a81b50] 127.0.0.1:47162]
I1221 15:01:56.647077 10 logs.go:40] http: TLS handshake error from 172.20.0.128:40481: EOF
I1221 15:02:20.555576 10 handlers.go:131] GET /api/v1/resourcequotas: (2m31.950471878s) 504
And at this point I think the service was simply not responding.
Although I suspected that we have run out of CPU credits (the t2 is burstable), this did not appear to be the case based on the CloudWatch metrics.
I did wonder whether maybe this corresponds to etcd compaction, and there was a compaction around the time, but it appears normal:
2015/12/21 14:58:58 etcdserver: start to snapshot (applied: 200020, lastsnap: 190019)
2015/12/21 14:58:58 etcdserver: compacted log at index 200020
2015/12/21 14:58:58 wal: segmented wal file /var/etcd/data/member/wal/0000000000000014-0000000000030d55.wal is created
2015/12/21 14:58:58 etcdserver: saved snapshot at index 200020
2015/12/21 14:59:00 filePurge: successfully removed file /var/etcd/data/member/wal/000000000000000f-0000000000024a00.wal
2015/12/21 14:59:05 filePurge: successfully removed file /var/etcd/data/member/snap/0000000000000002-00000000000249ff.snap
I captured all of /var/log, but I didn’t see anything that leapt out at me as being suspicious.
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Comments: 19 (10 by maintainers)
This happened again on the same cluster. I was unable to SSH in, but this time I just left it alone (did not restart it). Some hours later, it started responding to SSH again, and uptime indicated that it had not rebooted.
Looking at the
dmesgoutput, I can see that kube-apiserver was killed by the OOM killer. (dmesg output below). And then some time later, td-agent was killed by the OOM killer, which is when I presume the machine started responding again.It might be that 1GB is simply not enough for the master. Or that we should configure some swap space, although I believe we deliberately don’t do this because the minute we start swapping we are likely going to go unresponsive anyway.
Also, although kube-apiserver is technically running, the machine is very slow and kube-apiserver is not in practice responsive. The TLS handshakes still timeout. There are long delays logged reading from etcd:
Unusually (at least to my experience), top shows a lot of time in ‘wa’ state:
dmesg output relating to kube-apiserver kill: