kedro: [KED-2639] Cannot read csv in chunks with pandas

Description

Cannot read csv in chunks with kedro data catalog.

df = pd.read_csv(csv, chunksize=1000) df.get_chunk()

Context

How has this bug affected you? What were you trying to accomplish?

Steps to Reproduce

train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

df = catalog.load(“train_dataset”) df.get_chunk()

ValueError: I/O operation on closed file. df <pandas.io.parsers.TextFileReader at 0x7fde97a82450>

Expected Result

I should be able to loop over the reader.

Actual Result

ValueError: I/O operation on closed file.

-- If you received an error, place it here.

ValueError: I/O operation on closed file.

```yaml
train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

– Separate them if you have more than one.


## Your Environment
Include as many relevant details about the environment in which you experienced the bug:

* Kedro version used (`pip show kedro` or `kedro -V`):
kedro: 0.16.6
* Python version used (`python -V`):
3.7.5
* Operating system and version:
Ubuntu

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 2
Comments: 18 (12 by maintainers)

Most upvoted comments

For anyone who is looking for a hotfix, thanks to the dynamic nature of python, we can fix it without touching the source code.

Alternatively, you can create a custom DataSet inherit from the CSVDataSet and simply override the _load() method.

from typing import Any, Dict
from kedro.extras.datasets.pandas import CSVDataSet
import pandas as pd

from kedro.io.core import (
    get_filepath_str,
    get_protocol_and_path,
)


def _load(self)  -> pd.DataFrame:
    load_path = get_filepath_str(self._get_load_path(), self._protocol)

    return pd.read_csv(load_path, **self._load_args)

CSVDataSet._load = _load

noklam on Jun 21, 2021

I believe the problem here is that the context manager that is used in catalog.load for a csv file closes the file: https://github.com/quantumblacklabs/kedro/blob/e17a5e44e6d1ec1335b4cb69011babd7f38cad9b/kedro/extras/datasets/pandas/csv_dataset.py#L157

Since pandas added fsspec support in their API starting with version 1.1.0, we are in the process of converting this code (and others like JSONDataSet) to use pd.read_* without the need for the context manager. This should fix the bug but won’t be out until kedro 0.18.

In the mean time, I think you should be able to easily fix it just by removing the context manager to give the following (I just tried this out briefly and seemed to work, but use at your own risk…):

   def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)
        return pd.read_csv(load_path, **self._load_args)

Note also that since pandas 1.2 TextFileReader (which is what is returned when specifying chunksize) is now a context manager - see https://github.com/pandas-dev/pandas/pull/38225. It’s still iterable, so correct usage would now be:

with dataset_name as chunks:
    for chunk in chunks:
        process(chunk)

antonymilne on Jun 1, 2021

@WaylonWalker Thanks for jumping in, I have read your blog about Kedro befoe it helps me understand some concepts better.

When I iterate it it throws error that saying file is closed already.

noklam on Nov 6, 2020

I’m facing the same issue, anyone has updates on this problem?

carlosbertoncelli on Dec 15, 2020

@WaylonWalker I did the same thing for checking if it is the problem of fsspec -> seems not too. catalog.load() will first call fsspec, then it also calls the transformer, I suspect transformer tries to read that generator and closed it.

But I haven’t dig dive into transformer before yet, it would be great if someone has more knowledge jump in.

noklam on Nov 6, 2020

I was able to replicate. I setup a pipeline with a csv and a catalog entry just as you did. I run into the same error if I try to kedro run or catalog.load it. I am not able to replicate the issue just loading with pandas, even if I use fsspec like the pandas.CSVDataSet does. Someone with a deeper understanding of the internals may need to take a look

I posted my replica of the issue here https://github.com/WaylonWalker/kedro_chunked.

I have read your blog about Kedro befoe it helps me understand some concepts better.

That is awesome!!! and potentially motivating to keep making more content.

WaylonWalker on Nov 6, 2020