test-infra: Job pull-npd-e2e-test failing: ssh: handshake failed
What happened:
After updating the OS image family for job pull-npd-e2e-test to cos-stable (https://github.com/kubernetes/test-infra/pull/29263), test started failing with the following errors:
Error storing debugging data to test artifacts: [Error running command: {prow $NODE_IP curl http://localhost:20257/metrics 0 error getting SSH client to prow@$NODE_IP:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain'}
Error running command: {prow $NODE_IP sudo journalctl -u node-problem-detector.service 0 error getting SSH client to prow@$NODE_IP:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain'}
Error running command: {prow $NODE_IP sudo journalctl -k 0 error getting SSH client to prow@$NODE_IP:22: 'ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain'}
Everytime the scenario tries to execute a command, it fails to connect to the VM with this SSH handshake failure.
What you expected to happen: Test does connect to the VM and successfully executes command…
How to reproduce it (as minimally and precisely as possible):
Trigger job pull-npd-e2e-test in a PR under kubernetes/node-problem-detector.
Please provide links to example occurrences, if any: See all failures starting April 13th: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-npd-e2e-test
Anything else we need to know?: This issue is blocking new PRs to NPD repo.
This is an issue I saw in another place. OpenSSH deprecated using hash algorithm SHA-1 for method RSA in version 8.9 release notes. COS stable, which is COS M105 uses OpenSSH version 9.3:
$ ssh -V
OpenSSH_9.3p1, OpenSSL 1.1.1t 7 Feb 2023
If the test is trying to SSH to the VM using an RSA public key, it must not use SHA-1 as hashing algorithm (SHA-256 or SHA-512 should work). An other alternative is to switch to an elliptic curve key.
/sig node /sig testing /priority important-soon /cc Random-Liu xmcqueen
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (15 by maintainers)
There’s a fixed key used here (for insecure CI only) to avoid running out of space in GCE metadata, it’s quite possible that key is using an algorithm that is no longer supported. Someone with access to the prow.k8s.io services cluster (http://go.k8s.io/oncall test-infra, not me) would have to update that particular secret AFAIK, as it doesn’t have a secret manager instance. Alternatively someone would have to set up secret manager https://docs.prow.k8s.io/docs/prow-secrets/
Ah, looks like @vteratipally was already upgrading k/k on https://github.com/kubernetes/node-problem-detector/pull/734 but it got closed. Let me send a similar PR.
I took a dive into the NPD E2E lib and it seems that when it executes SSH commands to the node, it does this through
golang.org/x/crypto/sshlibrary. In the version used on vendor (a very old one, I may say), the default hashing algorithm for RSA keys is indeed SHA1 (source), which is not supported incos-stable(cos-105-ltsat the time this comment is written).In contrast,
remote_runner(the library used for most E2E tests) runs the SSH command directly from the OS viaos.Exec(source, which delegates the hashing algorithm decision (as well as other nitty details) to the OS. I believe this is the reason why the other jobs are not failing while this one is.In addition to rollbacking #29330, there are a couple of solutions I can think of:
golang.org/x/crypto/sshin kubernetes/kubernetes dependencies: risky, might break several things and in a way, falls out of scope for this fix.cos-stableto an older version that still supports SHA1: easier solution, but would result in not using the latest image, and eventually we’ll need to address this problem again.preset-k8s-ssh: safest option, but might not be that clean.From Slack thread:
If using the ssh preset, the creds should be available inside the pod. So inline via env is not needed.
Ref:https://github.com/kubernetes/test-infra/blob/20764650d1ae688f3898fbb176717270209bcd29/config/prow/config.yaml#L830