smart_open: Ranged read http header leaves TO bytes blank, causing S3 filesystems to read full file length

Problem description

smart_open is effectively crashing SeaweedFS/S3 and Ceph/S3 filesystems when doing many small reads over large files (ex: 32k read on a 4GB file).

On a SeaweedFS/S3 filesystem (also demonstrated on Ceph/S3), using the code shown in the reproduce section below, I am reading a small segment of data (32k in this example) from a large file (4GB in this example). This simple request causes SeaweedFS to move the full file for just a 32k range read. It appears that this is expected behavior based on a reading of the protocol specifications. Notably boto3 does not trigger the same behavior.

When I run the code and look at the HTTP traffic being generated we see the following GET request:

GET /hengenlab/CAF77/Neural_Data/highpass_750/Headstages_256_Channels_int16_2021-02-02_14-28-24.bin HTTP/1.1
Host: seaweed-filer.seaweedfs:8333
Accept-Encoding: identity
Range: bytes=0-
User-Agent: Boto3/1.24.59 Python/3.8.2 Linux/5.4.0-125-generic Botocore/1.27.59
X-Amz-Date: 20220921T204638Z
X-Amz-Content-SHA256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=KL2PPBIZ4OYKR420C28D/20220921/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=8009c4fdc85311977066c6988047a72658579e02c02b544fa8d48d8d8b9e8d57
amz-sdk-invocation-id: eace9bc1-a1d1-4244-84ee-1caa164bc294
amz-sdk-request: attempt=1

Notably Range: bytes=0- is our culprit. My assumption is that smart_open intends to open the file for streaming and read data from the stream as dictated by calls to f.read(...).

When performing a ranged read with just boto3 code the header looks like this:

GET /hengenlab/CAF77/Neural_Data/highpass_750/Headstages_256_Channels_int16_2021-02-02_14-28-24.bin?partNumber=1 HTTP/1.1
Host: seaweed-filer.seaweedfs:8333
Accept-Encoding: identity
Range: bytes=1-32768
User-Agent: Boto3/1.24.59 Python/3.8.2 Linux/5.4.0-125-generic Botocore/1.27.59
X-Amz-Date: 20220921T205137Z
X-Amz-Content-SHA256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=KL2PPBIZ4OYKR420C28D/20220921/us-east-1/s3/aws4_request, SignedHeaders=host;range;x-amz-content-sha256;x-amz-date, Signature=5d3398396c217f4a87284479bc6bc947344c256a30552fe98b6057167d7143fb
amz-sdk-invocation-id: 489f165b-cb4d-4ca0-bc49-cc1e70618518
amz-sdk-request: attempt=1

Using boto3 does not cause any issue. With smart_open the SeaweedFS/S3 filesystem is interpreting the lack of a to-bytes value as the full file. It is then moving the full (4 GB) data file from a volume server to an S3 server, where just 32k are passed to the end user job. This has the effect of very quickly oversaturating the network.

The protocol specifications seem to agree that the behavior by SeaweedFS/S3 is the correct way to interpret this Range header. E.g. how can the filesystem know that the user won’t need to read the whole file given this header.

If the last-byte-pos value is absent, or if the value is greater than or equal to the current length of the entity-body, last-byte-pos is taken to be equal to one less than the current length of the entity- body in bytes.

https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35

Steps/code to reproduce the problem

This code triggers the full (4 GB) file transfer (internally, not to the end user application) for a small 32k ranged read.

import import smart_open

while True:
    with smart_open.open('s3://bucket/path/to/4gb/file.bin', 'rb') as f:
        b = f.read(32768)
        print(f'Read {len(b)}')

This boto3 version of the code does not trigger the same issue:

import boto3

while True:
    obj = boto3.resource('s3', endpoint_url='https://swfs-s3.endpoint').Object('bucket', 'path/to/4gb/file.bin')
    stream = obj.get(Range='bytes=1-32768')['Body']
    res = stream.read()
    print(f'Read: {len(res)} bytes')

I’d like to open a discussion to determine how to properly interpret the S3 protocol and figure out whether an issue like this should be in the domain of the filesystem(s) or should be changed in smart_open.

Versions

>>> print(platform.platform())
Linux-5.19.0-76051900-generic-x86_64-with-glibc2.17
>>> print("Python", sys.version)
Python 3.8.10 (default, Jun  4 2021, 15:09:15) 
[GCC 7.5.0]
>>> print("smart_open", smart_open.__version__)
smart_open 6.0.0

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 3
Comments: 19 (4 by maintainers)

Commits related to this issue

Rewrite of s3.Reader class to protect S3 servers from open range headers. See: https://github.com/RaRe-Technologies/smart_open/issues/725 — committed to davidparks21/smart_open by davidparks21 2 years ago
Rewrite of s3.Reader class to protect S3 servers from open range headers (#725) — committed to davidparks21/smart_open by davidparks21 2 years ago
Rewrite of s3.Reader class to protect S3 servers from open range headers (#725) — committed to davidparks21/smart_open by davidparks21 2 years ago
Rewrite of s3.Reader class to protect S3 servers from open range headers (#725) — committed to davidparks21/smart_open by davidparks21 2 years ago
Rewrite of s3.Reader class to protect S3 servers from open range headers (#725) — committed to davidparks21/smart_open by davidparks21 2 years ago

Most upvoted comments

After sleeping on this here’s the solution I propose implementing (I’m working towards this and a pull request).

We have two use cases that are at odds here:

Use case 1: The file is opened for streaming and many read calls will be performed. This is common if you open a file and pass it to many underlying libraries which were optimized for local IO access. With many small reads of the file, repeating the HTTP headers would be an unacceptable overhead, it would also translate into an explosion of IOPS, which would not be good for the S3 filesystem either.

Use case 2: A single small ranged read of a large file. This use case is common in many applications such as machine learning, where a large binary data file contains many small samples. The file is opened, read once, and closed, often at a very high frequency. For example, my use case is to read up to 20 samples/s of 25 Mb segments of a multi-GB file, up to 500 MB/s, to feed a single GPU, and I often run 10’s of GPUs. In this use case, the overhead of an unspecified byte range is a showstopper because it doesn’t allow the S3 filesystem to optimize the internal queuing of the file.

These two use cases are fundamentally at odds, so any solution is a balancing act between them. I propose we remove defer_seek from transport_params, it’s unnecessary (see comment below), and add a transport_params called optimize_reads which takes on one of three user-specified values: auto (default), optimize_for_streaming, and optimize_for_partial_reads.

Description of these modes:

optimize_for_streaming: operates like smart_open does now, the byte-range header is left unbounded and multiple reads pull data from the open body using the current method of buffer-size | read-size using one open HTTP request.
optimize_for_partial_reads: closes the body after each read and specifies the byte-range header optimally for each read request. A new HTTP connection + headers is sent for each read call.
auto: the default balance between these two modes. For the first read call optimize_for_partial_reads is used. This way the byte range is set appropriately for the single-read ML-style use case and the S3 filesystem is protected. For the second and subsequent calls optmize_for_streaming is then used, this satisfies the first use case where many small read calls are performed, and it only costs one extra connection + set of HTTP headers, which is a minimal cost to automatically satisfy the partial reads use case.

Tasks necessary to implement this (assuming no one raises a concern here then I’m working on a pull request with these changes):

seek will not open the body, it will only update the position.
Remove defer_seek (deprecate and make it a noop), it’s unnecessary when seek doesn’t open the body. There are 14 points in s3.py where self._content_length is referenced. Those 14 points need to all support a potential None value. This is not unreasonable.
Add optimize_reads to transport_params as discussed above.
Add a read call counter to Reader / _SeekableRawReader to support the auto option described above.
Add a conditional step in Reader / _SeekableRawReader that closes the body when optimize_for_partial_reads == True or (auto == True and read_call_counter > 1).

davidparks21 on Oct 16, 2022

I’ve been looking through the smart_open code today to find a solution to this problem. I notice that the range is set by default in Reader.__init__() where a seek(0) is run. This seek doesn’t have the to-bytes information so the range header is set to bytes=from- at this point. There is an option in transport_params to defer_seek=True which skips this seek.

If you skip that first seek, the first call to read() will end up setting the same range:

class _SeekableRawReader(object):
    # ...
    def read(self, size=-1):
        """Read from the continuous connection with the remote peer."""
        if self._body is None:
            # This is necessary for the very first read() after __init__().
            self._open_body()

self._open_body() is called without parameters, which has the same effect of setting bytes=from-.

I guess the most obvious solution then is to set the to-bytes in the call to self._open_body() and to use the defer_seek transport param. I’ve tested that this works and solved the problem in SeaweedFS.

However, I question why defer_seek defaults to False, or why it’s there at all if the seek operation is going to effectively crash at least 2 major S3 filesystems (Ceph and SeaweedFS tested). So I’d still like to discuss how best to proceed. I’m happy to put in a pull request for this.

davidparks21 on Oct 10, 2022

Yes, that makes sense.

Seeking should not close the body when the seek position is within the buffer
I think we can remove it, and provide documentation for people who may have been using that parameter.
10MB default makes sense. Maybe call it download_bufsize?
Close the old body and open a new body to satisfy the read request, right?
Yes

Sounds like a fair bit of work, have fun 😉

mpenkov on Oct 17, 2022

So I only succeeded in solving the problem for a single read less than the buffer size (~132k by default). There’s a more fundamental issue at play.

smart_open is opening the file for streaming which explicitly does NOT generate multiple HTTP requests. Instead multiple calls to f.read(...) keep pulling data from the StreamingBody on the same open HTTP request. This has the obvious benefit of reducing network overhead (avoiding repeat HTTP headers). However, in that scenaro there is no way for smart_open to set an appropriate range for the file at the time it’s opened without the user explicitly restricting the file to a specific range. This has the negative side effect of forcing the filesystem to queue up the full file (SeaweedFS/S3) or a large chunk of it (Ceph/S3).

So we’re damned if we do and damned if we don’t.

One solution is to give the user the ability to specify a range in the open request. If they fail to do this then we fall back on multiple HTTP requests which can set the range correctly.

Another solution from the filesystem would be for the filesystem to only queue up what it needs as bytes are read. While we might be able to get one filesystem to implement this, ensuring all S3 filesystems operate that way is probably wishful thinking.

I can’t think of another way to handle this situation in a way that both protects the filesystem and keeps network overhead to a minimum. Can anyone else?

davidparks21 on Oct 15, 2022