kubernetes: API server gives inconsistent responses for CRD
What happened:
On a machine with kubernetes v1.18.10 installed (one node only) after a system reboot (the node was cordon and drained before reboot) the query of a CRD returned two different response:
- first response contained 2 retries on the API server not being available (we have a retry mechanism in place), ending with a “404 not found” response:
14-Feb-2021 19:37:21 [2021-02-15 02:37:21,162] WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bd1d0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:21 [2021-02-15 02:37:21,162] WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bdac8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:21 [2021-02-15 02:37:21,163] WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bd908>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:31 [2021-02-15 02:37:31,176] WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bdcc0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:31 [2021-02-15 02:37:31,178] WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bdb00>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:31 [2021-02-15 02:37:31,179] WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fefef6bd2e8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus
14-Feb-2021 19:37:44 [2021-02-15 02:37:44,890] DEBUG - response body: 404 page not found
- second response gave the correct information about the CRD
{{
"apiVersion":"nci.com/v1",
"items":[
{
"apiVersion":"nci.com/v1",
"kind":"ApplicationStatus",
"metadata":{
"creationTimestamp":"2021-02-15T00:34:19Z",
"generation":1,
"managedFields":[
{
"apiVersion":"nci.com/v1",
"fieldsType":"FieldsV1",
"fieldsV1":{
"f:spec":{...}
},
"manager":"Go-http-client",
"operation":"Update",
"time":"2021-02-15T00:34:19Z"
}
],
"name":"nci-base-default-appstatus",
"namespace":"kube-system",
"resourceVersion":"1484",
"selfLink":"/apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus",
"uid":"37e45712-5b45-403b-b21a-f4055b50145b"
},
"spec":{...}
},
...
],
"kind":"ApplicationStatusList",
"metadata":{
"continue":"",
"resourceVersion":"16464",
"selfLink":"/apis/nci.com/v1/applicationstatuses"
}
}}
The first query was a rest call to a specific resource in the CRD /apis/nci.com/v1/namespaces/kube-system/applicationstatuses/nci-base-default-appstatus, while the second was to all the resources in the CRD /apis/nci.com/v1/applicationstatuses
This is our CRD definition
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: applicationstatuses.nci.com
spec:
additionalPrinterColumns:
- JSONPath: .spec.appName
name: AppName
type: string
- JSONPath: .spec.appVersion
name: AppVersion
type: string
- JSONPath: .spec.appInstance
name: AppInstance
type: string
- JSONPath: .spec.appStatus
name: AppStatus
type: string
group: nci.com
versions:
- name: v1
served: true
storage: true
scope: Namespaced
names:
plural: applicationstatuses
singular: applicationstatus
kind: ApplicationStatus
shortNames:
- appstatus
What you expected to happen:
API server to not give a “404 not found” response. Sincerely, this scares me because inconsistencies in API server responses could lead to so many things going wrong.
How to reproduce it (as minimally and precisely as possible):
I was not able to reproduce this manually (although I tried killing the control pods multiple times and rebooting the system while monitoring the API server) but it only happened in one of our cicd pipelines.
Anything else we need to know?:
Mentioning that we are using k8s python client 17.14.0a1 to talk to the API server and that the CRD was not deleted in any way. I have looked at the k8s python client code and it does not alter the response in any one, it just does simple rest API calls to the k8s API server.
Environment:
- Kubernetes version (use
kubectl version): Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.10”, GitCommit:“62876fc6d93e891aa7fbe19771e6a6c03773b0f7”, GitTreeState:“clean”, BuildDate:“2020-10-15T01:52:24Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.10”, GitCommit:“62876fc6d93e891aa7fbe19771e6a6c03773b0f7”, GitTreeState:“clean”, BuildDate:“2020-10-15T01:43:56Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”} - Cloud provider or hardware configuration: Openstack
- OS (e.g:
cat /etc/os-release): Red Hat Enterprise Linux Server 7.9 (Maipo) - Kernel (e.g.
uname -a): Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Thu Jan 21 16:15:07 EST 2021 x86_64 x86_64 x86_64 GNU/Linux - Install tools: kubeadm
- Network plugin and version (if this is a network-related bug): flannel v0.12.0-34-g8936e90-amd64
- Others: k8s python client 17.14.0a1
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 5
- Comments: 16 (4 by maintainers)
As a side node, we experience something similar in our multi-node cluster deployments with rook for persistent storage. Our debugging lead us this thread where the rook community are talking about similar inconsistency with API server response https://github.com/rook/rook/issues/4274#issuecomment-554552377