kubernetes: GCE Cloud: Service (LoadBalancer) networking breaks when pod is restarted

Kubernetes version (use kubectl version):

Server Version: version.Info{
  Major:"1", Minor:"4", 
  GitVersion:"v1.4.7", 
  GitCommit:"92b4f971662de9d8770f8dcd2ee01ec226a6f6c0", 
  GitTreeState:"clean", 
  BuildDate:"2016-12-10T04:43:42Z", 
  GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"
}

Environment:

  • Cloud provider or hardware configuration: Google Cloud
  • OS (e.g. from /etc/os-release):
BUILD_ID=8820.0.0
NAME="Container-VM Image"
GOOGLE_CRASH_ID=Lakitu
VERSION_ID=55
BUG_REPORT_URL=https://crbug.com/new
PRETTY_NAME="Google Container-VM Image"
VERSION=55
GOOGLE_METRICS_PRODUCT_ID=26
HOME_URL="https://cloud.google.com/compute/docs/containers/vm-image/"
ID=gci
  • Kernel (e.g. uname -a):
Linux gke-bernie-cluster-default-pool-b7500cdf-3kr1 4.4.14+ #1 SMP Tue Sep 20 10:32:07 PDT 2016 x86_64 Intel(R) Xeo
n(R) CPU @ 2.50GHz GenuineIntel GNU/Linux
  • Install tools: none
  • Others: none

What happened: We have a replication controller that is responsible for our external REST API. The pod is exposed via a service. When k8s restarts the container, the service fails to redirect traffic to the restarted container. If all pods for an rc fail, the API is no longer externally accessible.

What you expected to happen: The service should automatically redirect traffic to a restarted pod and the process should be seamless.

How to reproduce it (as minimally and precisely as possible): Create a replication controller with one pod that serves as an external REST API. Expose all pods via a service. Force kubernetes to restart the pod. The API will no longer be exposed, even though logs show the pod functioning normally, and networking is broken.

Anything else do we need to know: Here’s the configuration for the service that links to the replication controller:

screen shot 2017-01-05 at 1 21 05 pm screen shot 2017-01-05 at 1 20 55 pm

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 51 (22 by maintainers)

Most upvoted comments

Okay, we just experienced deployment-wide pod failure and I can demonstrate the “no logging” scenario. Below, you can see that all pods across a deployment were suddenly restarted:

screen shot 2017-02-17 at 7 41 27 pm

Every single one of these containers displays the same message:

screen shot 2017-02-17 at 7 41 21 pm

Once I delete and redeploy all pods, it functions normally again.