scrapy: S3 CsvItemExporter read of closed file error
Description
Unable to use batch_item_count in s3 to export feeds using the CsvItemExporter. Could this be related to #4830? I’ve tried and succeeded in exporting the same feed to my local system, and also changing the format from csv to json and exporting to s3.
Steps to Reproduce
- Using the feed config of
"FEEDS": {
            "s3://bucket/%(name)s/%(batch_time)s.tsv": {
                "format": "csv",
                "batch_item_count": 10,
                "item_export_kwargs": {"delimiter": "\t"},
            }
        }
- Scrapy successfully parses data but has trouble exporting it to s3. However I’m not sure if this is an issue with botocoreorscrapy.botocore==1.20.29however I downgraded a few releases with no change in the issue.
Expected behavior: Produces the error below.
Actual behavior: Only successfully exports the last chunk that was processed.
Reproduces how often: 100%
Versions
Scrapy       : 2.4.1
lxml         : 4.6.2.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.2.0
Python       : 3.8.5 (default, Aug 11 2020, 11:08:40) - [Clang 11.0.3 (clang-1103.0.32.62)]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020)
cryptography : 3.3.1
Platform     : macOS-10.15.7-x86_64-i386-64bit
Additional context
2021-03-16 15:19:17 [scrapy.extensions.feedexport] ERROR: Error storing csv feed (10 items) in: s3://bucket/bucket-file.tsv
Traceback (most recent call last):
  File ".venv/lib/python3.8/site-packages/botocore/httpsession.py", line 314, in send
    urllib_response = conn.urlopen(
  File ".venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File ".venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File ".venv/lib/python3.8/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File ".pyenv/versions/3.8.5/lib/python3.8/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 92, in _send_request
    rval = super(AWSConnection, self)._send_request(
  File ".pyenv/versions/3.8.5/lib/python3.8/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File ".pyenv/versions/3.8.5/lib/python3.8/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 127, in _send_output
    self._handle_expect_response(message_body)
  File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 170, in _handle_expect_response
    self._send_message_body(message_body)
  File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 197, in _send_message_body
    self.send(message_body)
  File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 204, in send
    return super(AWSConnection, self).send(str)
  File ".pyenv/versions/3.8.5/lib/python3.8/http/client.py", line 963, in send
    datablock = data.read(self.blocksize)
  File ".pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 474, in func_wrapper
    return func(*args, **kwargs)
ValueError: read of closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File ".venv/lib/python3.8/site-packages/twisted/python/threadpool.py", line 238, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
  File ".venv/lib/python3.8/site-packages/twisted/python/threadpool.py", line 254, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
  File ".venv/lib/python3.8/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File ".venv/lib/python3.8/site-packages/twisted/python/context.py", line 83, in callWithContext
    return func(*args, **kw)
  File ".venv/lib/python3.8/site-packages/scrapy/extensions/feedexport.py", line 155, in _store_in_thread
    self.s3_client.put_object(
  File ".venv/lib/python3.8/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File ".venv/lib/python3.8/site-packages/botocore/client.py", line 662, in _make_api_call
    http, parsed_response = self._make_request(
  File ".venv/lib/python3.8/site-packages/botocore/client.py", line 682, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 136, in _send_request
    while self._needs_retry(attempts, operation_model, request_dict,
  File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 253, in _needs_retry
    responses = self._event_emitter.emit(
  File ".venv/lib/python3.8/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File ".venv/lib/python3.8/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File ".venv/lib/python3.8/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 250, in __call__
    should_retry = self._should_retry(attempt_number, response,
  File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 269, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 316, in __call__
    checker_response = checker(attempt_number, response,
  File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 222, in __call__
    return self._check_caught_exception(
  File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 200, in _do_get_response
    http_response = self._send(request)
  File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 269, in _send
    return self.http_session.send(request)
  File ".venv/lib/python3.8/site-packages/botocore/httpsession.py", line 359, in send
    raise HTTPClientError(error=e)
botocore.exceptions.HTTPClientError: An HTTP Client raised an unhandled exception: read of closed file
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 22 (12 by maintainers)
Commits related to this issue
- Merge pull request #5705 from srki24/issue5043-feed_export S3 CsvItemExporter read of closed file error #5043 — committed to scrapy/scrapy by wRAR a year ago
This is hopefully fixed now and I want to thank everyone who worked on investigating and fixing this!
This PR is in progress and fixes the issue. If this PR is still not merged, you can fix this by creating a custom CSV exporter.
and in settings.py set it.
Let me see if I can get it to work on my end. I’ll take a look tomorrow or the next.
@jackblk I more or less moved away from the issue as priorities changed for me. @marlenachatzigrigoriou did a great job and debugging most of it but we never found the underlying issue. Not sure if she wants to pick it up again?
Hello! @mmitropoulou and I would like to contribute to this issue.
https://github.com/scrapy/scrapy/pull/4830 was really a documentation issue, so providing you are not using a custom exporter or something in the lines, this looks more like a bug in either the CSV exporter or the handling of batch exports, maybe specific to storages like S3, which first write to a temporary file and then upload that file.