smart_open: Ranged read http header leaves TO bytes blank, causing S3 filesystems to read full file length
Problem description
smart_open
is effectively crashing SeaweedFS/S3 and Ceph/S3 filesystems when doing many small reads over large files (ex: 32k read on a 4GB file).
On a SeaweedFS/S3 filesystem (also demonstrated on Ceph/S3), using the code shown in the reproduce section below, I am reading a small segment of data (32k in this example) from a large file (4GB in this example). This simple request causes SeaweedFS to move the full file for just a 32k range read. It appears that this is expected behavior based on a reading of the protocol specifications. Notably boto3
does not trigger the same behavior.
When I run the code and look at the HTTP traffic being generated we see the following GET request:
GET /hengenlab/CAF77/Neural_Data/highpass_750/Headstages_256_Channels_int16_2021-02-02_14-28-24.bin HTTP/1.1
Host: seaweed-filer.seaweedfs:8333
Accept-Encoding: identity
Range: bytes=0-
User-Agent: Boto3/1.24.59 Python/3.8.2 Linux/5.4.0-125-generic Botocore/1.27.59
X-Amz-Date: 20220921T204638Z
X-Amz-Content-SHA256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=KL2PPBIZ4OYKR420C28D/20220921/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=8009c4fdc85311977066c6988047a72658579e02c02b544fa8d48d8d8b9e8d57
amz-sdk-invocation-id: eace9bc1-a1d1-4244-84ee-1caa164bc294
amz-sdk-request: attempt=1
Notably Range: bytes=0-
is our culprit. My assumption is that smart_open
intends to open the file for streaming and read data from the stream as dictated by calls to f.read(...)
.
When performing a ranged read with just boto3
code the header looks like this:
GET /hengenlab/CAF77/Neural_Data/highpass_750/Headstages_256_Channels_int16_2021-02-02_14-28-24.bin?partNumber=1 HTTP/1.1
Host: seaweed-filer.seaweedfs:8333
Accept-Encoding: identity
Range: bytes=1-32768
User-Agent: Boto3/1.24.59 Python/3.8.2 Linux/5.4.0-125-generic Botocore/1.27.59
X-Amz-Date: 20220921T205137Z
X-Amz-Content-SHA256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=KL2PPBIZ4OYKR420C28D/20220921/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=5d3398396c217f4a87284479bc6bc947344c256a30552fe98b6057167d7143fb
amz-sdk-invocation-id: 489f165b-cb4d-4ca0-bc49-cc1e70618518
amz-sdk-request: attempt=1
Using boto3
does not cause any issue. With smart_open the SeaweedFS/S3 filesystem is interpreting the lack of a to-bytes
value as the full file. It is then moving the full (4 GB) data file from a volume server to an S3 server, where just 32k are passed to the end user job. This has the effect of very quickly oversaturating the network.
The protocol specifications seem to agree that the behavior by SeaweedFS/S3 is the correct way to interpret this Range header. E.g. how can the filesystem know that the user won’t need to read the whole file given this header.
If the last-byte-pos value is absent, or if the value is greater than or equal to the current length of the entity-body, last-byte-pos is taken to be equal to one less than the current length of the entity- body in bytes.
https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
Steps/code to reproduce the problem
This code triggers the full (4 GB) file transfer (internally, not to the end user application) for a small 32k ranged read.
import import smart_open
while True:
with smart_open.open('s3://bucket/path/to/4gb/file.bin', 'rb') as f:
b = f.read(32768)
print(f'Read {len(b)}')
This boto3
version of the code does not trigger the same issue:
import boto3
while True:
obj = boto3.resource('s3', endpoint_url='https://swfs-s3.endpoint').Object('bucket', 'path/to/4gb/file.bin')
stream = obj.get(Range='bytes=1-32768')['Body']
res = stream.read()
print(f'Read: {len(res)} bytes')
I’d like to open a discussion to determine how to properly interpret the S3 protocol and figure out whether an issue like this should be in the domain of the filesystem(s) or should be changed in smart_open
.
Versions
>>> print(platform.platform())
Linux-5.19.0-76051900-generic-x86_64-with-glibc2.17
>>> print("Python", sys.version)
Python 3.8.10 (default, Jun 4 2021, 15:09:15)
[GCC 7.5.0]
>>> print("smart_open", smart_open.__version__)
smart_open 6.0.0
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 3
- Comments: 19 (4 by maintainers)
Commits related to this issue
- Rewrite of s3.Reader class to protect S3 servers from open range headers. See: https://github.com/RaRe-Technologies/smart_open/issues/725 — committed to davidparks21/smart_open by davidparks21 2 years ago
- Rewrite of s3.Reader class to protect S3 servers from open range headers (#725) — committed to davidparks21/smart_open by davidparks21 2 years ago
- Rewrite of s3.Reader class to protect S3 servers from open range headers (#725) — committed to davidparks21/smart_open by davidparks21 2 years ago
- Rewrite of s3.Reader class to protect S3 servers from open range headers (#725) — committed to davidparks21/smart_open by davidparks21 2 years ago
- Rewrite of s3.Reader class to protect S3 servers from open range headers (#725) — committed to davidparks21/smart_open by davidparks21 2 years ago
After sleeping on this here’s the solution I propose implementing (I’m working towards this and a pull request).
We have two use cases that are at odds here:
Use case 1: The file is opened for streaming and many
read
calls will be performed. This is common if you open a file and pass it to many underlying libraries which were optimized for local IO access. With many small reads of the file, repeating the HTTP headers would be an unacceptable overhead, it would also translate into an explosion of IOPS, which would not be good for the S3 filesystem either.Use case 2: A single small ranged read of a large file. This use case is common in many applications such as machine learning, where a large binary data file contains many small samples. The file is opened, read once, and closed, often at a very high frequency. For example, my use case is to read up to 20 samples/s of 25 Mb segments of a multi-GB file, up to 500 MB/s, to feed a single GPU, and I often run 10’s of GPUs. In this use case, the overhead of an unspecified byte range is a showstopper because it doesn’t allow the S3 filesystem to optimize the internal queuing of the file.
These two use cases are fundamentally at odds, so any solution is a balancing act between them. I propose we remove
defer_seek
fromtransport_params
, it’s unnecessary (see comment below), and add atransport_params
calledoptimize_reads
which takes on one of three user-specified values:auto
(default),optimize_for_streaming
, andoptimize_for_partial_reads
.Description of these modes:
optimize_for_streaming
: operates likesmart_open
does now, the byte-range header is left unbounded and multiple reads pull data from the open body using the current method of buffer-size | read-size using one open HTTP request.optimize_for_partial_reads
: closes the body after each read and specifies the byte-range header optimally for each read request. A new HTTP connection + headers is sent for eachread
call.auto
: the default balance between these two modes. For the firstread
calloptimize_for_partial_reads
is used. This way the byte range is set appropriately for the single-read ML-style use case and the S3 filesystem is protected. For the second and subsequent callsoptmize_for_streaming
is then used, this satisfies the first use case where many smallread
calls are performed, and it only costs one extra connection + set of HTTP headers, which is a minimal cost to automatically satisfy the partial reads use case.Tasks necessary to implement this (assuming no one raises a concern here then I’m working on a pull request with these changes):
seek
will not open the body, it will only update the position.defer_seek
(deprecate and make it a noop), it’s unnecessary when seek doesn’t open the body. There are 14 points ins3.py
whereself._content_length
is referenced. Those 14 points need to all support a potentialNone
value. This is not unreasonable.optimize_reads
totransport_params
as discussed above.read
call counter toReader
/_SeekableRawReader
to support theauto
option described above.Reader
/_SeekableRawReader
that closes the body whenoptimize_for_partial_reads == True or (auto == True and read_call_counter > 1)
.I’ve been looking through the
smart_open
code today to find a solution to this problem. I notice that the range is set by default inReader.__init__()
where aseek(0)
is run. This seek doesn’t have the to-bytes information so the range header is set tobytes=from-
at this point. There is an option intransport_params
todefer_seek=True
which skips this seek.If you skip that first seek, the first call to
read()
will end up setting the same range:self._open_body()
is called without parameters, which has the same effect of settingbytes=from-
.I guess the most obvious solution then is to set the to-bytes in the call to
self._open_body()
and to use thedefer_seek
transport param. I’ve tested that this works and solved the problem in SeaweedFS.However, I question why
defer_seek
defaults toFalse
, or why it’s there at all if the seek operation is going to effectively crash at least 2 major S3 filesystems (Ceph and SeaweedFS tested). So I’d still like to discuss how best to proceed. I’m happy to put in a pull request for this.Yes, that makes sense.
download_bufsize
?Sounds like a fair bit of work, have fun 😉
So I only succeeded in solving the problem for a single read less than the buffer size (~132k by default). There’s a more fundamental issue at play.
smart_open
is opening the file for streaming which explicitly does NOT generate multiple HTTP requests. Instead multiple calls tof.read(...)
keep pulling data from theStreamingBody
on the same open HTTP request. This has the obvious benefit of reducing network overhead (avoiding repeat HTTP headers). However, in that scenaro there is no way forsmart_open
to set an appropriate range for the file at the time it’s opened without the user explicitly restricting the file to a specific range. This has the negative side effect of forcing the filesystem to queue up the full file (SeaweedFS/S3) or a large chunk of it (Ceph/S3).So we’re damned if we do and damned if we don’t.
One solution is to give the user the ability to specify a range in the open request. If they fail to do this then we fall back on multiple HTTP requests which can set the range correctly.
Another solution from the filesystem would be for the filesystem to only queue up what it needs as bytes are read. While we might be able to get one filesystem to implement this, ensuring all S3 filesystems operate that way is probably wishful thinking.
I can’t think of another way to handle this situation in a way that both protects the filesystem and keeps network overhead to a minimum. Can anyone else?