kubernetes: [Failing Test] (gce-master-scale-performance) Error while dumping cluster logs

Which jobs are failing?

master-informing:

gce-master-scale-performance

Which tests are failing?

kubetest.DumpClusterLogs

Since when has it been failing?

~https://github.com/kubernetes/kubernetes/pull/121242~

Testgrid link

https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance

Reason for failure (if possible)

I1018 09:46:14.094967   21599 exec_service.go:123] Exec service: tearing down service
2023/10/18 09:46:29 process.go:155: Step '/home/prow/go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --experimental-gcp-snapshot-prometheus-disk=true --experimental-prometheus-disk-snapshot-name=ci-kubernetes-e2e-gce-scale-performance-1714540134923243520 --experimental-prometheus-snapshot-to-report-dir=true --nodes=5000 --prometheus-scrape-node-exporter --provider=gce --report-dir=/logs/artifacts --testconfig=testing/load/config.yaml --testconfig=testing/huge-service/config.yaml --testconfig=testing/access-tokens/config.yaml --testoverrides=./testing/experiments/enable_restart_count_check.yaml --testoverrides=./testing/experiments/ignore_known_gce_container_restarts.yaml --testoverrides=./testing/overrides/5000_nodes.yaml' finished in 1h57m49.081257202s
2023/10/18 09:46:29 e2e.go:569: Dumping logs from nodes to GCS directly at path: gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1714540134923243520
2023/10/18 09:46:29 process.go:153: Running: /workspace/log-dump.sh /logs/artifacts gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1714540134923243520
Checking for custom logdump instances, if any
Using gce provider, skipping check for LOG_DUMP_SSH_KEY and LOG_DUMP_SSH_USER
Project: k8s-infra-e2e-scale-5k-project
Network Project: k8s-infra-e2e-scale-5k-project
Zone: us-east1-b
Dumping logs temporarily to '/tmp/tmp.Lh8re5VkdY/logs'. Will upload to 'gs://k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1714540134923243520' later.
Dumping logs from master locally to '/tmp/tmp.Lh8re5VkdY/logs'
Trying to find master named 'gce-scale-cluster-master'
Looking for address 'gce-scale-cluster-master-ip'
Looking for address 'gce-scale-cluster-master-internal-ip'
Using master: gce-scale-cluster-master (external IP: 104.196.48.60; internal IP: 10.40.0.2)
Changing logfiles to be world-readable for download
Copying 'kube-apiserver.log kube-apiserver-audit.log kube-scheduler.log cloud-controller-manager.log kube-controller-manager.log etcd.log etcd-events.log glbc.log cluster-autoscaler.log kube-addon-manager.log konnectivity-server.log fluentd.log kubelet.cov cl2-* startupscript.log' from gce-scale-cluster-master
Specify --start=117505 in the next get-serial-port-output invocation to get only the new output starting from here.

client_loop: send disconnect: Broken pipe
/usr/bin/scp: Connection closed
ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [255].
ERROR: (gcloud.compute.scp) Could not fetch resource:
 - The resource 'projects/k8s-infra-e2e-scale-5k-project/zones/us-east1-b/instances/gce-scale-cluster-master' was not found

Anything else we need to know?

No response

Relevant SIG(s)

/sig testing cc @kubernetes/ci-signal

About this issue

Original URL
State: closed
Created 8 months ago
Reactions: 1
Comments: 31 (29 by maintainers)

Commits related to this issue

Increase logging around log-dump Even more observability attempts for kubernetes/kubernetes#121320 Strace syscalls touching the target file of the scp, this should allow us to see if any progress is ... — committed to rf232/test-infra by rf232 8 months ago

Most upvoted comments

/milestone v1.29

pacoxu on Nov 6, 2023

CI gce-cos-master-scalability-100 is fixed by https://github.com/kubernetes/test-infra/pull/31197. (recent 4 run: 3 passed.)

Thanks @aojea for the comment below.

it seems @wojtekt has a clue, and is related to the bump on the images https://github.com/kubernetes/test-infra/commit/392423fd1037f1b17463e0148087655f48c0d810, since the logdump seems to not get the names of the nodes and is using the wrong url to try to get the node logs

pacoxu on Nov 7, 2023

similar issue observed with another test https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1718311964112850944 but the logs look a bit different, this one is not failing with 404 error

debug1: Sending subsystem: sftp
debug1: pledge: fork
/usr/bin/scp: debug1: Fetching /var/log/etcd.log to /tmp/tmp.gsbMp65qMj/logs/gce-scale-cluster-master

/usr/bin/scp: open local "/tmp/tmp.gsbMp65qMj/logs/gce-scale-cluster-master": No such file or directory
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: channel 0: free: client-session, nchannels 1
Transferred: sent 3700, received 6280 bytes, in 0.5 seconds
Bytes per second: sent 7749.7, received 13153.6
debug1: Exit status 0
DEBUG: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
Traceback (most recent call last):
  File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 987, in Execute
    resources = calliope_command.Run(cli=self, args=args)
  File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 807, in Run
    resources = command_instance.Run(args)
  File "/google-cloud-sdk/lib/surface/compute/scp.py", line 185, in Run
    return scp_helper.RunScp(
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/compute/scp_utils.py", line 227, in RunScp
    cmd.Run(
  File "/google-cloud-sdk/lib/googlecloudsdk/command_lib/util/ssh/ssh.py", line 1995, in Run
    raise CommandError(args[0], return_code=status)
googlecloudsdk.command_lib.util.ssh.ssh.CommandError: [/usr/bin/scp] exited with return code [1].
ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
Deleted [https://www.googleapis.com/compute/v1/projects/k8s-infra-e2e-scale-5k-project/global/firewalls/gce-scale-cluster-minion-http-alt].
DEBUG: Running [gcloud.compute.scp] with arguments: [--project: "k8s-infra-e2e-scale-5k-project", --recurse: "True", --scp-flag: "['-v']", --verbosity: "debug", --zone: "us-east1-b", [[USER@]INSTANCE:]DEST: "/tmp/tmp.gsbMp65qMj/logs/gce-scale-cluster-master", [[USER@]INSTANCE:]SRC:1: "['gce-scale-cluster-master:/var/log/etcd-events.log*']"]
DEBUG: Starting new HTTPS connection (1): compute.googleapis.com:443
DEBUG: [https://compute.googleapis.com:443](https://compute.googleapis.com/) "GET /compute/v1/projects/k8s-infra-e2e-scale-5k-project/zones/us-east1-b/instances/gce-scale-cluster-master?alt=json HTTP/1.1" 200 None
DEBUG: Starting new HTTPS connection (1): compute.googleapis.com:443
DEBUG: [https://compute.googleapis.com:443](https://compute.googleapis.com/) "GET /compute/v1/projects/k8s-infra-e2e-scale-5k-project?alt=json HTTP/1.1" 200 None

reetasingh on Oct 29, 2023

@pacoxu the job still has old version of the kubekins-e2e image that holds the log-dump.sh file from test-infra: gcr.io/k8s-staging-test-infra/kubekins-e2e:v20231015-d38ebb23ab-master.

We need to expedite the submission of https://github.com/kubernetes/test-infra/pull/31061 to get more debug logs.

tosi3k on Oct 23, 2023