curl_cffi: [BUG] Gets stuck forever in various scenarios with `stream=True`

Describe the bug

Running our project’s test suite against curl_cffi@0.5.10b2 using streaming in sync mode, we encountered many tests getting stuck.

From this I’ve put together some test cases covering some various conditions where it gets stuck: https://gist.github.com/coletdjnz/034729c002e39fd97ba8199abf03dff4

One of the common issues appears that in the perform() function, the errors raised by curl appear to not be re-raised in the main thread https://github.com/yifeikong/curl_cffi/blob/c0e6fa0df34ea3d0278597279bc2c0744cd70187/curl_cffi/requests/session.py#L596

So if curl errors on request, it gets stuck on

            # Wait for the first chunk
            header_recved.wait()  # type: ignore

as there would not be any data being sent from curl.

Similarly goes for reading with iter_content, it gets stuck on the queue.get() as no data would be received from curl.

Additionally check out test_multiple_session_req and test_multiple_session_req_read. For some reason, when making multiple subsequent requests with the same session, curl is throwing Failed to perform, ErrCode: 3, Reason: 'No URL set'. on these (hence them getting stuck).

Note that when you run test_multiple_session_req_read, it sometimes does not get stuck and passes, suggesting there may be a race-condition somewhere?

Expected behavior A clear and concise description of what you expected to happen.

Versions

OS: Linux x64 curl_cffi 0.5.10b2

cffi==1.15.1
curl-cffi==0.5.10b2
iniconfig==2.0.0
packaging==23.1
pluggy==1.2.0
pycparser==2.21
pytest==7.4.0

Additional context Which session are you using? async or sync? Which loop implementation are you using

Sync

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 28 (14 by maintainers)

Most upvoted comments

If there are no other issues, I will release 0.5.10, and move on to new browser support.

Available as 0.5.10b4, except for macOS M1, for which I have to build the wheels manually due to lack of github action instance.

You can also abort the transfer by returning CURL_WRITEFUNC_ERROR (added in 7.87.0), which makes CURLE_WRITE_ERROR get returned.

This goes back to the question of upgrading curl version in #53, the current base version is 7.84 if I remember correctly.

Could this perhaps be used to pause the transfer if nothing is being read? (i.e so it doesn’t download the entire response content if you don’t read it or only read/iter chunks by a small amount)

pause seems to be promising.

Yes, we should be able to do that, signal the perform thread to quit when close is called.

You mean this?

>>> from curl_cffi import requests
>>> s = requests.Session()
>>> s.get("not valid", stream=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/yifei/repos/curl_cffi/curl_cffi/requests/session.py", line 650, in request
    raise first_element
curl_cffi.requests.errors.RequestsError: Failed to perform, ErrCode: 3, Reason: ''. This may be a libcurl error, See https://curl.se/libcurl/c/libcurl-errors.html first for more details.
>>> 
>>> from curl_cffi import CurlOpt
>>> s.curl.setopt(CurlOpt.MAXREDIRS, "three")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/yifei/repos/curl_cffi/curl_cffi/curl.py", line 139, in setopt
    c_value = ffi.new("int*", value)
TypeError: an integer is required

The problem lies in the fact that the two concurrent requests are sharing the same curl instance. It’s not a big change to use seperate curl instances for different ongoing requests, which is how it’s implemented in the AsyncSession.

However, another behavior needs to be noticed is that the data starts to be buffered in the memory once you fire off the requests. Current implementation is a workaround of the blocking perform, by putting which in ThreadPoolExecutor.submit to run in a background thread and receiving queued chunks in the main thread.

Thus, if you create two streaming requests at once and only consume from one, the other one might be completely buffered in the memory, which is unlike how requests/httpx behaved, where you have control over the rate, and probably not what you presume.

Anyway, if you find some libcurl API, with which we can get rid of this turn-callback-into-iterative mess, I’m more than happy to change the implementation.