ec2-fleet-plugin: New nodes get I/O error and disconnect at ssh timeout

When the ec2-fleet-plugin adds new spotfleet instances as Jenkins nodes using the Launcher selection of “Launch agent agents via SSH”, the nodes all connect just fine, but some percentage of them disconnect with a SEVERE I/O error shortly after launch. The I/O error happens at exactly the configured “Connection Timeout in Seconds” from launch. When reconnected after that, they have no problems.

I have not confirmed, but I think this only happens when “Max Idle Minutes Before Scaledown” is set (attaching the IdleRetentionStrategy to the node).

Here’s the sequence from the logs with the default ssh connection timeout of 210 seconds:

15:01:11.832 - INFO: Found new instances from fleet (ec2-fleet test): [<snip>, i-08db665d464785aec, <snip>]
15:01:21.966 - INFO: Idle Retention initiated
15:01:21.967 - INFO: Attempting to reconnect i-08db665d464785aec
15:01:56.067 - SSH Launch of i-08db665d464785aec on 10.21.131.211 completed in 34,083 ms
15:04:51.990 - SEVERE: I/O error in channel i-08db665d464785aec

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 49

Most upvoted comments

So, I create the following PR which seems to fix the issue: https://github.com/jenkinsci/trilead-ssh2/pull/36

I don’t get any disconnection after that, with the following log:

Connection refused (Connection refused)
SSH Connection failed with IOException: "Connection refused (Connection refused)", retrying in 15 seconds.  There are 10 more retries left.
Connection refused (Connection refused)
SSH Connection failed with IOException: "Connection refused (Connection refused)", retrying in 15 seconds.  There are 9 more retries left.
Connection refused (Connection refused)
SSH Connection failed with IOException: "Connection refused (Connection refused)", retrying in 15 seconds.  There are 8 more retries left.
Connection refused (Connection refused)
SSH Connection failed with IOException: "Connection refused (Connection refused)", retrying in 15 seconds.  There are 7 more retries left.
[11/21/18 13:36:29] [SSH] The SSH key with fingerprint ae:22:af:b1:8a:5e:71:6c:0a:48:79:e1:b0:73:54:26 has been automatically trusted for connections to this machine.
[11/21/18 13:36:29] [SSH] Authentication successful.

... [Printing environment details]

Nov 21, 2018 1:36:29 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /var/lib/jenkins/remoting as a remoting work directory
Both error and output logs will be printed to /var/lib/jenkins/remoting
<===[JENKINS REMOTING CAPACITY]===>channel started

... [More environment details]

slave setup done.
Nov 21, 2018 1:36:34 PM org.jenkinsci.remoting.util.AnonymousClassWarnings warn
WARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.envinject.EnvInjectComputerListener$2; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
Agent successfully connected and online

... [Following error is when the plugin scales down the fleet size and the node actually dies, a bit over 5 minutes later.]

ERROR: Connection terminated
java.io.EOFException
	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
	at hudson.remoting.Command.readFrom(Command.java:140)
	at hudson.remoting.Command.readFrom(Command.java:126)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
ERROR: Socket connection to SSH server was lost
java.io.IOException: Cannot read full block, EOF reached.
	at com.trilead.ssh2.crypto.cipher.CipherInputStream.getBlock(CipherInputStream.java:81)
	at com.trilead.ssh2.crypto.cipher.CipherInputStream.read(CipherInputStream.java:108)
	at com.trilead.ssh2.transport.TransportConnection.receiveMessage(TransportConnection.java:232)
	at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:706)
	at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:502)
	at java.lang.Thread.run(Thread.java:748)
Slave JVM has not reported exit code before the socket was lost
[11/21/18 13:42:06] [SSH] Connection closed.

Since it is a problem with the trilead timeoutHandler, I’m going to try to work around this issue by increasing the “Connection Timeout in Seconds” to a massive number and see if that stabilizes my Jenkins environment. The drawback to this approach is that the connection attempt could hang indefinitely. That seems preferable to a rogue asynchronous process killing my builds by cutting the ssh connection.

@LoveDuckie Can you please open a new issue along with logs and provide info around your setup? I’m closing off this issue as it is pretty old and having latest info per newer releases would be helpful

Indeed upping the timeout to an insane amount and I adding “docker info &&” to Prefix command works like a charm. Thanks for the work around!

🤦‍♂️ nevermind, I get it now, it’s in the cloud configuration section of the plugin configuration

I found another reference to this issue (see the latest comments by Eugene): https://issues.jenkins-ci.org/browse/JENKINS-48955