k3s: Increased failure rate on exec/attach discovered on csi e2e tests

Environmental Info: K3s Version:

  • v1.23.7-rc1+k3s1 (in docker mode)
  • v1.24.1-rc2+k3s1 (in default/containerd mode)

Node(s) CPU architecture, OS, and Version:

Linux k3s-master 5.4.0-113-generic #127-Ubuntu SMP Wed May 18 14:30:56 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

Single server

Describe the bug:

We experience flaky tests in the ci of cinder-csi-plugin of cloud-provider-openstack, all related to the following error

un  2 13:10:56.391: INFO: ExecWithOptions {Command:[/bin/sh -c echo +mhlxaKrCV35dwsvcJbvbp3CFlaSAVZGRbNHMSHOYhvigtOOoprNIwi7vQbcbq58smAeLqT9MVTdwIAnzyOh3Q== | base64 -d | sha256sum] Namespace
:multivolume-8810 PodName:pod-71cf7d75-fdd8-48f7-b700-3eb09465429e ContainerName:write-pod Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false Quiet:false}              
Jun  2 13:10:56.391: INFO: >>> kubeConfig: /root/.kube/config                                                                                                                                   
Jun  2 13:10:56.391: INFO: ExecWithOptions: Clientset creation                                                                                                                                  
Jun  2 13:10:56.392: INFO: ExecWithOptions: execute(POST https://172.24.5.182:6443/api/v1/namespaces/multivolume-8810/pods/pod-71cf7d75-fdd8-48f7-b700-3eb09465429e/exec?command=%2Fbin%2Fsh&com
mand=-c&command=echo+%2BmhlxaKrCV35dwsvcJbvbp3CFlaSAVZGRbNHMSHOYhvigtOOoprNIwi7vQbcbq58smAeLqT9MVTdwIAnzyOh3Q%3D%3D+%7C+base64+-d+%7C+sha256sum&container=write-pod&container=write-pod&stderr=t
rue&stdout=true)                                                                                                                                                                                
Jun  2 13:10:56.405: FAIL: "echo +mhlxaKrCV35dwsvcJbvbp3CFlaSAVZGRbNHMSHOYhvigtOOoprNIwi7vQbcbq58smAeLqT9MVTdwIAnzyOh3Q== | base64 -d | sha256sum" should succeed, but failed with error message
 "error dialing backend: EOF"                                                                                                                                                                   
stdout:
stderr:                                                                                                                                                                                         
Unexpected error:                                                                                                                                                                               
    <*errors.StatusError | 0xc00257cd20>: {                                                                                                                                                     
        ErrStatus: {                                                                                                                                                                            
            TypeMeta: {Kind: "", APIVersion: ""},                                                                                                                                               
            ListMeta: {                                                                                                                                                                         
                SelfLink: "",                                                                                                                                                                   
                ResourceVersion: "",                                                                                                                                                            
                Continue: "",                                                                                                                                                                   
                RemainingItemCount: nil,                                                                                                                                                        
            },                                                                                                                                                                                  
            Status: "Failure",                                                                                                                                                                  
            Message: "error dialing backend: EOF",                                                                                                                                              
            Reason: "",                                                                                                                                                                         
            Details: nil,                                                                                                                                                                       
            Code: 500,                                                                                                                                                                          
        },                                                                                                                                                                                      
    }                                                                                                                                                                                           
    error dialing backend: EOF                                                                                                                                                                  
occurred 

Steps To Reproduce:

So far I didn’t find a way to reproduce without running e2e tests, but wanted to create this ticket to let you know about the problem. Maybe you are already aware 😃 If I can help testing, please let me know, I do have a relatively reliable way to reproduce the problem.

When I find a way to reproduce it properly, I will update this issue.

  • Must run on OpenStack
  • Installed K3s:
mkdir -p /var/lib/rancher/k3s/agent/images/
curl -sSL https://github.com/k3s-io/k3s/releases/download/v1.24.1-rc2+k3s1/k3s-airgap-images-amd64.tar -o /var/lib/rancher/k3s/agent/images/k3s-airgap-images.tar
curl -sSL https://github.com/k3s-io/k3s/releases/download/v1.24.1-rc2+k3s1/k3s -o /usr/local/bin/k3s
curl -sSL https://get.k3s.io -o /var/lib/rancher/k3s/install.sh
chmod u+x /var/lib/rancher/k3s/install.sh /usr/local/bin/k3s
INSTALL_K3S_SKIP_DOWNLOAD=true /var/lib/rancher/k3s/install.sh --disable traefik --disable metrics-server --disable servicelb --disable-cloud-controller --kubelet-arg="cloud-provider=external" --tls-san 172.24.5.182 --token 9b08jz.c0izixklcxymnze7

Deploy cloud-provider-openstack and and cinder-csi-plugin

Run cinder-csi e2e.

mkdir -p /var/log/csi-pod
/tmp/kubernetes/test/bin/e2e.test \
  -storage.testdriver=/root/src/k8s.io/cloud-provider-openstack/tests/e2e/csi/cinder/test-driver.yaml \
  -ginkgo.focus='External\s+Storage\s+\[Driver:\s+cinder.csi.openstack.org\]\s+\[Testpattern:\s+Dynamic\s+PV\s+\(ext4\)\]\s+multiVolume\s+\[Slow\]\s+should\s+access\s+to\s+two\s+volumes\s+with\s+the\s+same\s+volume\s+mode\s+and\s+retain\s+data\s+across\s+pod\s+recreation\s+on\s+the\s+same\s+node' \
  -ginkgo.skip='\[Disruptive\]|\[Testpattern:\s+Dynamic\s+PV\s+\(default\s+fs\)\]\s+provisioning\s+should\s+mount\s+multiple\s+PV\s+pointing\s+to\s+the\s+same\s+storage\s+on\s+the\s+same\s+node|\[Testpattern:\s+Dynamic\s+PV\s+\(default\s+fs\)\]\s+provisioning\s+should\s+provision\s+storage\s+with\s+any\s+volume\s+data\s+source\s+\[Serial\]' \
  -ginkgo.noColor \
  -ginkgo.progress \
  -ginkgo.v \
  -test.timeout=0 \
  -report-dir="/var/log/csi-pod" | tee "/var/log/csi-pod/cinder-csi-e2e.log"

Expected behavior:

No errors. On 1.23.6 we don’t experience any errors.

Actual behavior:

Happens frequently: See https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/openstack-cloud-csi-cinder-e2e-test/1532228188275478528 for example. All or most failed tests are related to the mentioned error

Additional context / logs:

journalctl -u k3s

Jun 02 13:10:56 k3s-master k3s[131992]: E0602 13:10:56.401545  131992 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"error dialing backend: EOF}: error dialing backend: EOF
Jun 02 13:10:56 k3s-master k3s[131992]: I0602 13:10:56.400513  131992 log.go:195] http: TLS handshake error from 127.0.0.1:35008: tls: first record does not look like a TLS handshake

Backporting

  • Needs backporting to older releases

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 27 (23 by maintainers)

Commits related to this issue

Most upvoted comments

@consideRatio curl -sfL https://get.k3s.io | sh -s - --disable traefik --disable servicelb --write-kubeconfig-mode 644 --egress-selector-mode=disabled

If someone wants a reproduction of our intermittent issues, you can fork https://github.com/jupyterhub/zero-to-jupyterhub-k8s and run the workflow called “Test chart” in your fork using GitHub Actions. We rely on the GitHub action https://github.com/jupyterhub/action-k3s-helm to setup a k3s environment for us to install a Helm chart and run tests against.

More specifically, the logs from the test runs provided above are runs in github actions from this pr branch: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/2798

The k3s setup is done via the action https://github.com/jupyterhub/action-k3s-helm that has a few steps described in the action.yaml file.

I’m writing from mobile atm, got to go!