kedro: [KED-2639] Cannot read csv in chunks with pandas
Description
Cannot read csv in chunks with kedro data catalog.
df = pd.read_csv(csv, chunksize=1000) df.get_chunk()
Context
How has this bug affected you? What were you trying to accomplish?
Steps to Reproduce
train_dataset:
type: pandas.CSVDataSet
filepath: 'mycsv.csv'
load_args:
chunksize: 50000
df = catalog.load(“train_dataset”) df.get_chunk()
ValueError: I/O operation on closed file. df <pandas.io.parsers.TextFileReader at 0x7fde97a82450>
Expected Result
I should be able to loop over the reader.
Actual Result
ValueError: I/O operation on closed file.
-- If you received an error, place it here.
ValueError: I/O operation on closed file.
```yaml
train_dataset:
type: pandas.CSVDataSet
filepath: 'mycsv.csv'
load_args:
chunksize: 50000
– Separate them if you have more than one.
## Your Environment
Include as many relevant details about the environment in which you experienced the bug:
* Kedro version used (`pip show kedro` or `kedro -V`):
kedro: 0.16.6
* Python version used (`python -V`):
3.7.5
* Operating system and version:
Ubuntu
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 18 (12 by maintainers)
For anyone who is looking for a hotfix, thanks to the dynamic nature of python, we can fix it without touching the source code.
Alternatively, you can create a custom DataSet inherit from the CSVDataSet and simply override the
_load()method.I believe the problem here is that the context manager that is used in
catalog.loadfor a csv file closes the file: https://github.com/quantumblacklabs/kedro/blob/e17a5e44e6d1ec1335b4cb69011babd7f38cad9b/kedro/extras/datasets/pandas/csv_dataset.py#L157Since pandas added
fsspecsupport in their API starting with version 1.1.0, we are in the process of converting this code (and others likeJSONDataSet) to usepd.read_*without the need for the context manager. This should fix the bug but won’t be out until kedro 0.18.In the mean time, I think you should be able to easily fix it just by removing the context manager to give the following (I just tried this out briefly and seemed to work, but use at your own risk…):
Note also that since pandas 1.2
TextFileReader(which is what is returned when specifyingchunksize) is now a context manager - see https://github.com/pandas-dev/pandas/pull/38225. It’s still iterable, so correct usage would now be:@WaylonWalker Thanks for jumping in, I have read your blog about Kedro befoe it helps me understand some concepts better.
When I iterate it it throws error that saying file is closed already.
I’m facing the same issue, anyone has updates on this problem?
@WaylonWalker I did the same thing for checking if it is the problem of fsspec -> seems not too. catalog.load() will first call fsspec, then it also calls the
transformer, I suspect transformer tries to read that generator and closed it.But I haven’t dig dive into transformer before yet, it would be great if someone has more knowledge jump in.
I was able to replicate. I setup a pipeline with a csv and a catalog entry just as you did. I run into the same error if I try to
kedro runorcatalog.loadit. I am not able to replicate the issue just loading with pandas, even if I use fsspec like thepandas.CSVDataSetdoes. Someone with a deeper understanding of the internals may need to take a lookI posted my replica of the issue here https://github.com/WaylonWalker/kedro_chunked.
That is awesome!!! and potentially motivating to keep making more content.