net-sftp: Downloading large files (>4GB) fails with hang

Overview

@testflyjets over at SD Ruby came across a puzzling issue when downloading files larger than 4 GB using Net::SFTP#download!. In summary, the download runs (with 64K read_size) up to 4294049792 bytes (≈3.99915 GB), at which point the process hangs. When given an interrupt (ctrl+c) after the hang, the top stack trace entry points to /net-ssh-2.6.7/lib/net/ssh/ruby_compat.rb:30. I know this is within Net::SSH, but I thought this might be a good place to start, since this use case is pretty specific to SFTP.

The Gist linked below contains a script that I’ve been using to troubleshoot. It’s a pretty basic Net::SFTP implementation that downloads a single file. I’m using a large tar file that is 4404802560 bytes (≈4.1 GB).

https://gist.github.com/bradland/5568217

To reproduce, alter the parameters of the script to work for your local environment (you’ll have to create your own large file; see Util section near end), then execute. You’ll see the progress output stop at around 4294049792 bytes, at which point you can issue ctrl+c to see a stack trace.

Investigating the stack trace

The original author was using SFTP#download!, but the stack trace when using SFTP#download is a lot easier to parse. You can see the output in 001-net_sftp_download.txt. Below is a linked list of the stack locations for easy reference:

Execution environment:

uname -a output: Darwin macbookair.local 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan 6 22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64
When running, the Ruby process is listed as “Intel (64 bit)” in activity monitor
sshd is also listed as “Intel (64 bit)”
sshd version: OpenSSH_5.9p1, OpenSSL 0.9.8r 8 Feb 2011
Ruby version: 1.9.3 p392
Gems (clean rvm gemset)
- bundler (1.3.5)
- columnize (0.3.6)
- debugger (1.5.0)
- debugger-linecache (1.2.0)
- debugger-ruby_core_source (1.2.0)
- net-sftp (2.1.2)
- net-ssh (2.6.7)
Tested using localhost for SFTP#start host argument, as well as another MacBook Pro running Mountain Lion on the same LAN
Original author’s execution environment was CentOS 64-bit (no information yet on version)

Introspection (rdebug, gdb)

The hang appears to occur at the IO#select call within Net::SSH, ruby_compat.rb. I don’t know how type safe IO#select is, but given the proximity to 4^32, and the fact that this is the point that we jump in to a system call (from my amateur perspective), it seems a likely place for a buffer overflow. I’m not terribly proficient at this low level. Because ruby_compat.rb is just a compatibility wrapper, we can look at Net::SSH::Connection#process for the setup:

def process(wait=nil, &block)
  return false unless preprocess(&block)

  r = listeners.keys
  w = r.select { |w2| w2.respond_to?(:pending_write?) && w2.pending_write? }
  readers, writers, = Net::SSH::Compat.io_select(r, w, nil, wait)

  postprocess(readers, writers)
end

This is the point where I think it makes sense to start inspecting, but I’m kind of at a loss for how to do that.

I invoked the script with the Ruby debugger gem active, but once the process hangs, the rdb console becomes unresponsive. I fired up gdb -p <process_id> in order to get the output of where listed in 003-gdb-where.txt (also in the Gist above).

This takes a long time to reproduce, so I’ve left a hung process on a secondary computer with a gdb session attached. I can issue any commands you’d like, or I’m available for a screen sharing session (I have a MeetingBurner account we could use) if a maintainer would like to see what’s going on under the hood.

Update: 4/13 session with @erikh

erikh suggested that the hang may not be a problem with select at all, but rather sshd has sent something back to Net::SSH that is not being handled properly by Net::SSH, and the processing loop just jumps right back in to asking for more data. At that point sshd just remains silent, and because the SSH loop is using an infinite wait period, it hangs.

Util

To generate a large file quickly. WARNING! This will create a large file (≈4.1 GB; ≈4.4 GB according to Finder… which is wrong) on your hard drive. Make sure you have the space.

mkfile 4404802560 test-data.txt

This uses the same bytesize as my test file, which is kind of arbitrary. I haven’t tested for the lower boundary of the bug.

Edits

Updated large file creator to use mkfile because dd was tossing an invalid argument error when given /dev/zero on OS X
Updated Gist to use test-data.txt instead of tar
Updated Gist to use Net::SFTP#download instead of Net::SFTP#download! for cleaner stack trace
Eliminated include Net to keep the namespace as clean as possible (was vestigial from a code example)

About this issue

Original URL
State: closed
Created 11 years ago
Comments: 18

Commits related to this issue

Warn if it looks like we're going to hit https://github.com/net-ssh/net-sftp/issues/27. — committed to mfoemmel/fig by clonezone 10 years ago
Upgrade SFTP to v5.0.2. Hopefully, this fixes https://github.com/net-ssh/net-sftp/issues/27. — committed to fig-package-manager/fig by clonezone 6 years ago

Most upvoted comments

I’m closing the issue as i was able to reproduce the bug with 4.2.0 but not with 5.0.0.rc1. Please comment with details if you don’t agree.

mfazekas on May 24, 2018