scrapy: S3 CsvItemExporter read of closed file error
Description
Unable to use batch_item_count in s3 to export feeds using the CsvItemExporter. Could this be related to #4830? I’ve tried and succeeded in exporting the same feed to my local system, and also changing the format from csv to json and exporting to s3.
Steps to Reproduce
- Using the feed config of
"FEEDS": {
"s3://bucket/%(name)s/%(batch_time)s.tsv": {
"format": "csv",
"batch_item_count": 10,
"item_export_kwargs": {"delimiter": "\t"},
}
}
- Scrapy successfully parses data but has trouble exporting it to s3. However I’m not sure if this is an issue with
botocoreorscrapy.botocore==1.20.29however I downgraded a few releases with no change in the issue.
Expected behavior: Produces the error below.
Actual behavior: Only successfully exports the last chunk that was processed.
Reproduces how often: 100%
Versions
Scrapy : 2.4.1
lxml : 4.6.2.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 21.2.0
Python : 3.8.5 (default, Aug 11 2020, 11:08:40) - [Clang 11.0.3 (clang-1103.0.32.62)]
pyOpenSSL : 20.0.1 (OpenSSL 1.1.1i 8 Dec 2020)
cryptography : 3.3.1
Platform : macOS-10.15.7-x86_64-i386-64bit
Additional context
2021-03-16 15:19:17 [scrapy.extensions.feedexport] ERROR: Error storing csv feed (10 items) in: s3://bucket/bucket-file.tsv
Traceback (most recent call last):
File ".venv/lib/python3.8/site-packages/botocore/httpsession.py", line 314, in send
urllib_response = conn.urlopen(
File ".venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File ".venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File ".venv/lib/python3.8/site-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File ".pyenv/versions/3.8.5/lib/python3.8/http/client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 92, in _send_request
rval = super(AWSConnection, self)._send_request(
File ".pyenv/versions/3.8.5/lib/python3.8/http/client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File ".pyenv/versions/3.8.5/lib/python3.8/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 127, in _send_output
self._handle_expect_response(message_body)
File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 170, in _handle_expect_response
self._send_message_body(message_body)
File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 197, in _send_message_body
self.send(message_body)
File ".venv/lib/python3.8/site-packages/botocore/awsrequest.py", line 204, in send
return super(AWSConnection, self).send(str)
File ".pyenv/versions/3.8.5/lib/python3.8/http/client.py", line 963, in send
datablock = data.read(self.blocksize)
File ".pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 474, in func_wrapper
return func(*args, **kwargs)
ValueError: read of closed file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".venv/lib/python3.8/site-packages/twisted/python/threadpool.py", line 238, in inContext
result = inContext.theWork() # type: ignore[attr-defined]
File ".venv/lib/python3.8/site-packages/twisted/python/threadpool.py", line 254, in <lambda>
inContext.theWork = lambda: context.call( # type: ignore[attr-defined]
File ".venv/lib/python3.8/site-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File ".venv/lib/python3.8/site-packages/twisted/python/context.py", line 83, in callWithContext
return func(*args, **kw)
File ".venv/lib/python3.8/site-packages/scrapy/extensions/feedexport.py", line 155, in _store_in_thread
self.s3_client.put_object(
File ".venv/lib/python3.8/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File ".venv/lib/python3.8/site-packages/botocore/client.py", line 662, in _make_api_call
http, parsed_response = self._make_request(
File ".venv/lib/python3.8/site-packages/botocore/client.py", line 682, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 136, in _send_request
while self._needs_retry(attempts, operation_model, request_dict,
File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 253, in _needs_retry
responses = self._event_emitter.emit(
File ".venv/lib/python3.8/site-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File ".venv/lib/python3.8/site-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File ".venv/lib/python3.8/site-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 250, in __call__
should_retry = self._should_retry(attempt_number, response,
File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 269, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 316, in __call__
checker_response = checker(attempt_number, response,
File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 222, in __call__
return self._check_caught_exception(
File ".venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
raise caught_exception
File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 200, in _do_get_response
http_response = self._send(request)
File ".venv/lib/python3.8/site-packages/botocore/endpoint.py", line 269, in _send
return self.http_session.send(request)
File ".venv/lib/python3.8/site-packages/botocore/httpsession.py", line 359, in send
raise HTTPClientError(error=e)
botocore.exceptions.HTTPClientError: An HTTP Client raised an unhandled exception: read of closed file
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 22 (12 by maintainers)
Commits related to this issue
- Merge pull request #5705 from srki24/issue5043-feed_export S3 CsvItemExporter read of closed file error #5043 — committed to scrapy/scrapy by wRAR a year ago
This is hopefully fixed now and I want to thank everyone who worked on investigating and fixing this!
This PR is in progress and fixes the issue. If this PR is still not merged, you can fix this by creating a custom CSV exporter.
and in settings.py set it.
Let me see if I can get it to work on my end. I’ll take a look tomorrow or the next.
@jackblk I more or less moved away from the issue as priorities changed for me. @marlenachatzigrigoriou did a great job and debugging most of it but we never found the underlying issue. Not sure if she wants to pick it up again?
Hello! @mmitropoulou and I would like to contribute to this issue.
https://github.com/scrapy/scrapy/pull/4830 was really a documentation issue, so providing you are not using a custom exporter or something in the lines, this looks more like a bug in either the CSV exporter or the handling of batch exports, maybe specific to storages like S3, which first write to a temporary file and then upload that file.