kubernetes: API server gives inconsistent responses for CRD

What happened:

On a machine with kubernetes v1.18.10 installed (one node only) after a system reboot (the node was cordon and drained before reboot) the query of a CRD returned two different response:

first response contained 2 retries on the API server not being available (we have a retry mechanism in place), ending with a “404 not found” response:

14-Feb-2021 19:37:21	[2021-02-15 02:37:21,162] WARNING  - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bd1d0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:21	[2021-02-15 02:37:21,162] WARNING  - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bdac8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:21	[2021-02-15 02:37:21,163] WARNING  - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bd908>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:31	[2021-02-15 02:37:31,176] WARNING  - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bdcc0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:31	[2021-02-15 02:37:31,178] WARNING  - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bdb00>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:31	[2021-02-15 02:37:31,179] WARNING  - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bd2e8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:44	[2021-02-15 02:37:44,890] DEBUG    - response body: 404 page not found

second response gave the correct information about the CRD

{{
   "apiVersion":"nci.com/v1",
   "items":[
      {
         "apiVersion":"nci.com/v1",
         "kind":"ApplicationStatus",
         "metadata":{
            "creationTimestamp":"2021-02-15T00:34:19Z",
            "generation":1,
            "managedFields":[
               {
                  "apiVersion":"nci.com/v1",
                  "fieldsType":"FieldsV1",
                  "fieldsV1":{
                     "f:spec":{...}
                  },
                  "manager":"Go-http-client",
                  "operation":"Update",
                  "time":"2021-02-15T00:34:19Z"
               }
            ],
            "name":"nci-base-default-appstatus",
            "namespace":"kube-system",
            "resourceVersion":"1484",
            "selfLink":"/apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus",
            "uid":"37e45712-5b45-403b-b21a-f4055b50145b"
         },
         "spec":{...}
      },
      ...
   ],
   "kind":"ApplicationStatusList",
   "metadata":{
      "continue":"",
      "resourceVersion":"16464",
      "selfLink":"/apis/nci.com/v1/applicationstatuses"
   }
}}

The first query was a rest call to a specific resource in the CRD /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus, while the second was to all the resources in the CRD /apis/nci.com/v1/applicationstatuses

This is our CRD definition

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: applicationstatuses.nci.com
spec:
  additionalPrinterColumns:
  - JSONPath: .spec.appName
    name: AppName
    type: string
  - JSONPath: .spec.appVersion
    name: AppVersion
    type: string
  - JSONPath: .spec.appInstance
    name: AppInstance
    type: string
  - JSONPath: .spec.appStatus
    name: AppStatus
    type: string
  group: nci.com
  versions:
    - name: v1
      served: true
      storage: true
  scope: Namespaced
  names:
    plural: applicationstatuses
    singular: applicationstatus
    kind: ApplicationStatus
    shortNames:
      - appstatus

What you expected to happen:

API server to not give a “404 not found” response. Sincerely, this scares me because inconsistencies in API server responses could lead to so many things going wrong.

How to reproduce it (as minimally and precisely as possible):

I was not able to reproduce this manually (although I tried killing the control pods multiple times and rebooting the system while monitoring the API server) but it only happened in one of our cicd pipelines.

Anything else we need to know?:

Mentioning that we are using k8s python client 17.14.0a1 to talk to the API server and that the CRD was not deleted in any way. I have looked at the k8s python client code and it does not alter the response in any one, it just does simple rest API calls to the k8s API server.

Environment:

Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.10”, GitCommit:“62876fc6d93e891aa7fbe19771e6a6c03773b0f7”, GitTreeState:“clean”, BuildDate:“2020-10-15T01:52:24Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.10”, GitCommit:“62876fc6d93e891aa7fbe19771e6a6c03773b0f7”, GitTreeState:“clean”, BuildDate:“2020-10-15T01:43:56Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”}
Cloud provider or hardware configuration: Openstack
OS (e.g: cat /etc/os-release): Red Hat Enterprise Linux Server 7.9 (Maipo)
Kernel (e.g. uname -a): Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Thu Jan 21 16:15:07 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm
Network plugin and version (if this is a network-related bug): flannel v0.12.0-34-g8936e90-amd64
Others: k8s python client 17.14.0a1

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 5
Comments: 16 (4 by maintainers)

Most upvoted comments

As a side node, we experience something similar in our multi-node cluster deployments with rook for persistent storage. Our debugging lead us this thread where the rook community are talking about similar inconsistency with API server response https://github.com/rook/rook/issues/4274#issuecomment-554552377

adabuleanu on Feb 22, 2021