test-infra: gke tests are missing log artifacts

Some upgrade tests are missings logs, making them practically impossible to debug and fix. e.g. https://github.com/kubernetes/kubernetes/issues/50793#issuecomment-326300909

Apparently we’re using the wrong scp command.

/assign @shyamjvs

slack conversation:

2:36 PM] 
fejta
yes because a) upgrading will reboot the master and b) we collect logs at the end of the run


[2:37 PM] 
@shyamjvs have you given any thoughts to this? Does your log collector help here?


[2:40 PM] 
One idea would be make the test which does the upgrade first dump logs if requested. Using a pattern similar to https://github.com/kubernetes/kubernetes/blob/0596891e424b4d9e1858117641cf93c7cf266450/test/e2e/e2e.go#L193
GitHub
kubernetes/kubernetes
kubernetes - Production-Grade Container Scheduling and Management
 


[2:41 PM] 
Alternatively add a dump cluster logs piece:
https://github.com/kubernetes/test-infra/blob/master/kubetest/e2e.go#L217

Before running upgrades: 
https://github.com/kubernetes/test-infra/blob/master/kubetest/e2e.go#L164
GitHub
kubernetes/test-infra
test-infra - Test infrastructure for the Kubernetes project.
 


[3:05 PM] 
abgworrall joined #sig-testing.



----- Today September 1st, 2017 -----
[5:50 AM] 
shyamjvs @fejta @ericchiang Upgrading the master leading to reboot is not the reason why the logs are failing to be collected


[5:52 AM] 
The log-dump script has nothing much to do with master reboot... it's collecting logs from nodes by individually SCP'ing to them


[5:54 AM] 
(which can happen even if the master is down (except for logs of the master itself))


[6:09 AM] 
shyamjvs The issue is in this line - https://github.com/kubernetes/kubernetes/blob/master/cluster/log-dump/log-dump.sh#L109
GitHub
kubernetes/kubernetes
kubernetes - Production-Grade Container Scheduling and Management
 


[6:11 AM] 
We're setting 'use_custom_instances_list' function for gke, which makes the script use scp. And scp expects an IP address (while we're actually providing the VM name there) (edited)


[6:13 AM] 
I'll send a PR to use 'gcloud compute scp' instead of 'scp' when node name is provided instead of IP. That should fix it.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 41 (41 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve got a working fix for this that I tested locally - I’ll send out a PR for it.

Github needs a For Science ! emoji thing 😃