werkzeug: werkzeug.formparser is really slow with large binary uploads
When I perform a multipart/form-data upload of any large binary file in Flask, those uploads are very easily CPU bound (with Python consuming 100% CPU) instead of I/O bound on any reasonably fast network connection.
A little bit of CPU profiling reveals that almost all CPU time during these uploads is spent in werkzeug.formparser.MultiPartParser.parse_parts(). The reason this that the method parse_lines() yields a lot of very small chunks, sometimes even just single bytes:
# we have something in the buffer from the last iteration.
# this is usually a newline delimiter.
if buf:
yield _cont, buf
buf = b''
So parse_parts() goes through a lot of small iterations (more than 2 million for a 100 MB file) processing single “lines”, always writing just very short chunks or even single bytes into the output stream. This adds a lot of overhead slowing down those whole process and making it CPU bound very quickly.
A quick test shows that a speed-up is very easily possible by first collecting the data in a bytearray in parse_lines() and only yielding that data back into parse_parts() when self.buffer_size is exceeded. Something like this:
buf = b''
collect = bytearray()
for line in iterator:
if not line:
self.fail('unexpected end of stream')
if line[:2] == b'--':
terminator = line.rstrip()
if terminator in (next_part, last_part):
# yield remaining collected data
if collect:
yield _cont, collect
break
if transfer_encoding is not None:
if transfer_encoding == 'base64':
transfer_encoding = 'base64_codec'
try:
line = codecs.decode(line, transfer_encoding)
except Exception:
self.fail('could not decode transfer encoded chunk')
# we have something in the buffer from the last iteration.
# this is usually a newline delimiter.
if buf:
collect += buf
buf = b''
# If the line ends with windows CRLF we write everything except
# the last two bytes. In all other cases however we write
# everything except the last byte. If it was a newline, that's
# fine, otherwise it does not matter because we will write it
# the next iteration. this ensures we do not write the
# final newline into the stream. That way we do not have to
# truncate the stream. However we do have to make sure that
# if something else than a newline is in there we write it
# out.
if line[-2:] == b'\r\n':
buf = b'\r\n'
cutoff = -2
else:
buf = line[-1:]
cutoff = -1
collect += line[:cutoff]
if len(collect) >= self.buffer_size:
yield _cont, collect
collect.clear()
This change alone reduces the upload time for my 34 MB test file from 4200 ms to around 1100 ms over localhost on my machine, that’s almost a 4X increase in performance. All tests are done on Windows (64-bit Python 3.4), I’m not sure if it’s as much of a problem on Linux.
It’s still mostly CPU bound, so I’m sure there is even more potential for optimization. I think I’ll look into it when I find a bit more time.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 6
- Comments: 33 (17 by maintainers)
Final summary now that our changes have landed in Github master. Small benchmark uploading a file of 64 MB random data 10 times in a row and measuring the average request time on an Intel Core i7-8550U:
10 requests. Avg request time: 1390.9 ms. 46.0 MiB/s10 requests. Avg request time: 91.3 ms. 701.0 MiB/sWith reasonable large files that’s a 15x improvement (the difference is a little lower with small files because of the request overhead) and on a somewhat fast server CPU Werkzeug’s multipart parser should now be able to saturate a gigabit ethernet link!
I’m happy with the result. 😃
I wanted to mention doing the parsing on the stream in chunks as it is received. @siddhantgoel wrote this great little parser for us. It’s working great for me. https://github.com/siddhantgoel/streaming-form-data
@davidism So I looked into your current implementation to check where it’s slow and I think it turns out that from here we can get another 10x speedup by adding less than 10 lines of code.
When uploading a large binary file most of the time is spent in the
elif self.state == State.DATAclause ofMultipartDecoder.next_event, about half inlist(LINE_BREAK_RE.finditer(self.buffer))and half in the remaining lines.But we don’t really need to look at all lines break. The trick is to offload as much work as possible to
bytes.find()which is really fast.When we execute
self.buffer.find(boundary_end)and it returns that nothing has been found, we can be sure that the ending boundary is not in self.buffer[:-len(boundary_end)] and just return this data without looking at it any further. We need to keep the lastlen(boundary_end)bytes of the buffer for the next iteration in case the ending boundary is on the border between two chunks.When uploading a large file almost all iterations of the loop can return immediately after
self.buffer.find(boundary_end). Only when it actually seems like we have and ending bounary we fall back to the code which checks for the line breaks and with the regular expressions.If you want to test it yourself add this to
MultipartDecoder.__init__():And then change the
elif self.state == State.DATAclause ofMultipartDecoder.next_eventinto this:Everything after the
elseis the old code unchanged and theelif self.buffer.find(self.boundary_end) == -1is the trick I described above. This change alone reduces the upload time of my 430 MB test from 7000 to 700 ms, a 10x speed-up!The
if len(self.buffer) <= len(self.boundary_end)was needed so that we don’t get an infinite loop, not sure if it’s correct.What do you think?
seconded. this speeds up file uploads to my Flask app by more than factor 10
Author of the other library here. I’m more than happy to review proposals/patches in case someone wants to provide an extension so it can work better with Werkzeug.
@siddhantgoel Thanks a lot for your fix with streaming-form-data. I can finally upload gigabyte sized files at good speed and without memory filling up!