werkzeug: werkzeug.formparser is really slow with large binary uploads

When I perform a multipart/form-data upload of any large binary file in Flask, those uploads are very easily CPU bound (with Python consuming 100% CPU) instead of I/O bound on any reasonably fast network connection.

A little bit of CPU profiling reveals that almost all CPU time during these uploads is spent in werkzeug.formparser.MultiPartParser.parse_parts(). The reason this that the method parse_lines() yields a lot of very small chunks, sometimes even just single bytes:

# we have something in the buffer from the last iteration.
# this is usually a newline delimiter.
if buf:
    yield _cont, buf
    buf = b''

So parse_parts() goes through a lot of small iterations (more than 2 million for a 100 MB file) processing single “lines”, always writing just very short chunks or even single bytes into the output stream. This adds a lot of overhead slowing down those whole process and making it CPU bound very quickly.

A quick test shows that a speed-up is very easily possible by first collecting the data in a bytearray in parse_lines() and only yielding that data back into parse_parts() when self.buffer_size is exceeded. Something like this:

buf = b''
collect = bytearray()
for line in iterator:
    if not line:
        self.fail('unexpected end of stream')

    if line[:2] == b'--':
        terminator = line.rstrip()
        if terminator in (next_part, last_part):
            # yield remaining collected data
            if collect:
                yield _cont, collect
            break

    if transfer_encoding is not None:
        if transfer_encoding == 'base64':
            transfer_encoding = 'base64_codec'
        try:
            line = codecs.decode(line, transfer_encoding)
        except Exception:
            self.fail('could not decode transfer encoded chunk')

    # we have something in the buffer from the last iteration.
    # this is usually a newline delimiter.
    if buf:
        collect += buf
        buf = b''

    # If the line ends with windows CRLF we write everything except
    # the last two bytes.  In all other cases however we write
    # everything except the last byte.  If it was a newline, that's
    # fine, otherwise it does not matter because we will write it
    # the next iteration.  this ensures we do not write the
    # final newline into the stream.  That way we do not have to
    # truncate the stream.  However we do have to make sure that
    # if something else than a newline is in there we write it
    # out.
    if line[-2:] == b'\r\n':
        buf = b'\r\n'
        cutoff = -2
    else:
        buf = line[-1:]
        cutoff = -1

    collect += line[:cutoff]

    if len(collect) >= self.buffer_size:
        yield _cont, collect
        collect.clear()

This change alone reduces the upload time for my 34 MB test file from 4200 ms to around 1100 ms over localhost on my machine, that’s almost a 4X increase in performance. All tests are done on Windows (64-bit Python 3.4), I’m not sure if it’s as much of a problem on Linux.

It’s still mostly CPU bound, so I’m sure there is even more potential for optimization. I think I’ll look into it when I find a bit more time.

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 6
Comments: 33 (17 by maintainers)

Links to this issue

streaming-form-data · PyPI

Most upvoted comments

Final summary now that our changes have landed in Github master. Small benchmark uploading a file of 64 MB random data 10 times in a row and measuring the average request time on an Intel Core i7-8550U:

Werkzeug 1.0.1: 10 requests. Avg request time: 1390.9 ms. 46.0 MiB/s
Werkzeug Master: 10 requests. Avg request time: 91.3 ms. 701.0 MiB/s

With reasonable large files that’s a 15x improvement (the difference is a little lower with small files because of the request overhead) and on a somewhat fast server CPU Werkzeug’s multipart parser should now be able to saturate a gigabit ethernet link!

I’m happy with the result. 😃

+34

sekrause on Jan 31, 2021

I wanted to mention doing the parsing on the stream in chunks as it is received. @siddhantgoel wrote this great little parser for us. It’s working great for me. https://github.com/siddhantgoel/streaming-form-data

sdizazzo on Jun 2, 2017

@davidism So I looked into your current implementation to check where it’s slow and I think it turns out that from here we can get another 10x speedup by adding less than 10 lines of code.

When uploading a large binary file most of the time is spent in the elif self.state == State.DATA clause of MultipartDecoder.next_event, about half in list(LINE_BREAK_RE.finditer(self.buffer)) and half in the remaining lines.

But we don’t really need to look at all lines break. The trick is to offload as much work as possible to bytes.find() which is really fast.

When we execute self.buffer.find(boundary_end) and it returns that nothing has been found, we can be sure that the ending boundary is not in self.buffer[:-len(boundary_end)] and just return this data without looking at it any further. We need to keep the last len(boundary_end) bytes of the buffer for the next iteration in case the ending boundary is on the border between two chunks.

When uploading a large file almost all iterations of the loop can return immediately after self.buffer.find(boundary_end). Only when it actually seems like we have and ending bounary we fall back to the code which checks for the line breaks and with the regular expressions.

If you want to test it yourself add this to MultipartDecoder.__init__():

self.boundary_end = b'--' + boundary + b'--'

And then change the elif self.state == State.DATA clause of MultipartDecoder.next_event into this:

elif self.state == State.DATA:
    if len(self.buffer) <= len(self.boundary_end):
        event = NEED_DATA
    elif self.buffer.find(self.boundary_end) == -1:
        data = bytes(self.buffer[:-len(self.boundary_end)])
        del self.buffer[:-len(self.boundary_end)]
        event = Data(data=data, more_data=True)
    else:
        # Return up to the last line break as data, anything past
        # that line break could be a boundary - more data may be
        # required to know for sure.
        lines = list(LINE_BREAK_RE.finditer(self.buffer))
        if len(lines):
            data_length = del_index = lines[-1].start()
            match = self.boundary_re.search(self.buffer)
            if match is not None:
                if match.group(1).startswith(b"--"):
                    self.state = State.EPILOGUE
                else:
                    self.state = State.PART
                data_length = match.start()
                del_index = match.end()

            data = bytes(self.buffer[:data_length])
            del self.buffer[:del_index]
            more_data = match is None
            if data or not more_data:
                event = Data(data=data, more_data=more_data)

Everything after the else is the old code unchanged and the elif self.buffer.find(self.boundary_end) == -1 is the trick I described above. This change alone reduces the upload time of my 430 MB test from 7000 to 700 ms, a 10x speed-up!

The if len(self.buffer) <= len(self.boundary_end) was needed so that we don’t get an infinite loop, not sure if it’s correct.

What do you think?

sekrause on Jan 30, 2021

I wanted to mention doing the parsing on the stream in chunks as it is received. @siddhantgoel wrote this great little parser for us. It’s working great for me. https://github.com/siddhantgoel/streaming-form-data

seconded. this speeds up file uploads to my Flask app by more than factor 10

patrislav1 on Oct 12, 2018

Author of the other library here. I’m more than happy to review proposals/patches in case someone wants to provide an extension so it can work better with Werkzeug.

siddhantgoel on Jan 27, 2021

@siddhantgoel Thanks a lot for your fix with streaming-form-data. I can finally upload gigabyte sized files at good speed and without memory filling up!

ghost on Sep 10, 2019