test-infra: Job pull-npd-e2e-test failing: ssh: handshake failed

What happened: After updating the OS image family for job pull-npd-e2e-test to cos-stable (https://github.com/kubernetes/test-infra/pull/29263), test started failing with the following errors:

Error storing debugging data to test artifacts: [Error running command: {prow $NODE_IP curl http://localhost:20257/metrics   0 error getting SSH client to prow@$NODE_IP:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain'}
Error running command: {prow $NODE_IP sudo journalctl -u node-problem-detector.service   0 error getting SSH client to prow@$NODE_IP:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain'}
Error running command: {prow $NODE_IP sudo journalctl -k   0 error getting SSH client to prow@$NODE_IP:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain'}

Everytime the scenario tries to execute a command, it fails to connect to the VM with this SSH handshake failure.

What you expected to happen: Test does connect to the VM and successfully executes command…

How to reproduce it (as minimally and precisely as possible): Trigger job pull-npd-e2e-test in a PR under kubernetes/node-problem-detector.

Please provide links to example occurrences, if any: See all failures starting April 13th: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-npd-e2e-test

Anything else we need to know?: This issue is blocking new PRs to NPD repo.

This is an issue I saw in another place. OpenSSH deprecated using hash algorithm SHA-1 for method RSA in version 8.9 release notes. COS stable, which is COS M105 uses OpenSSH version 9.3:

$ ssh -V
OpenSSH_9.3p1, OpenSSL 1.1.1t  7 Feb 2023

If the test is trying to SSH to the VM using an RSA public key, it must not use SHA-1 as hashing algorithm (SHA-256 or SHA-512 should work). An other alternative is to switch to an elliptic curve key.

/sig node /sig testing /priority important-soon /cc Random-Liu xmcqueen

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

There’s a fixed key used here (for insecure CI only) to avoid running out of space in GCE metadata, it’s quite possible that key is using an algorithm that is no longer supported. Someone with access to the prow.k8s.io services cluster (http://go.k8s.io/oncall test-infra, not me) would have to update that particular secret AFAIK, as it doesn’t have a secret manager instance. Alternatively someone would have to set up secret manager https://docs.prow.k8s.io/docs/prow-secrets/

Ah, looks like @vteratipally was already upgrading k/k on https://github.com/kubernetes/node-problem-detector/pull/734 but it got closed. Let me send a similar PR.

I took a dive into the NPD E2E lib and it seems that when it executes SSH commands to the node, it does this through golang.org/x/crypto/ssh library. In the version used on vendor (a very old one, I may say), the default hashing algorithm for RSA keys is indeed SHA1 (source), which is not supported in cos-stable (cos-105-lts at the time this comment is written).

In contrast, remote_runner (the library used for most E2E tests) runs the SSH command directly from the OS via os.Exec (source, which delegates the hashing algorithm decision (as well as other nitty details) to the OS. I believe this is the reason why the other jobs are not failing while this one is.

In addition to rollbacking #29330, there are a couple of solutions I can think of:

  1. Upgrade golang.org/x/crypto/ssh in kubernetes/kubernetes dependencies: risky, might break several things and in a way, falls out of scope for this fix.
  2. Change cos-stable to an older version that still supports SHA1: easier solution, but would result in not using the latest image, and eventually we’ll need to address this problem again.
  3. Create new SSH keys (ECDSA, ideally) and upload them to the Prow cluster: best solution, but could cause issues if certain scenarios do not support ECDSA (unlikely).
  4. Create new SSH keys and upload them to the Prow cluster, but not modify the existing preset-k8s-ssh: safest option, but might not be that clean.

From Slack thread:

If using the ssh preset, the creds should be available inside the pod. So inline via env is not needed.

Ref:https://github.com/kubernetes/test-infra/blob/20764650d1ae688f3898fbb176717270209bcd29/config/prow/config.yaml#L830