kedro: [KED-1473] `pandas.CSVDataSet` doesn't support `encoding` parameter

Description

I’m unable to load a non-utf-8-encoded file.

I know it doesn’t work because of https://github.com/quantumblacklabs/kedro/blob/f03226e29b8a018a0f6edab6d3f1a0d37c1b1812/kedro/extras/datasets/pandas/csv_dataset.py#L154-L155. For it to work, encoding would have to be passed to open. However, some file systems don’t support an encoding parameter… (e.g. gcsfs, I think).

Steps to Reproduce

Try loading https://github.com/beoutbreakprepared/nCoV2019/blob/433628fb828f3b3b3bff7d13195af357fe42e31d/ncov_outside_hubei.csv as a CSVDataSet.

Expected Result

I can load a cp1252-encoded file directly with pandas:

pd.read_csv("data/01_raw/nCoV2019/ncov_outside_hubei/20200304/ncov_outside_hubei.csv", encoding="cp1252")

Actual Result

I’m unable to load a cp1252-encoded file using Kedro:

DataSetError: Failed while loading data from data set CSVDataSet(filepath=data/01_raw/nCoV2019/ncov_outside_hubei/20200304/ncov_outside_hubei.csv, load_args={'encoding': cp1252, 'low_memory': False}, protocol=file, save_args={'index': False}).
'utf-8' codec can't decode byte 0xa0 in position 162456: invalid start byte

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): 0.15.8
  • Python version used (python -V): 3.7.6
  • Operating system and version: macOS Mojave Version 10.14.6

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

Thanks for the quick reply @mzjp2

I modified /kedro/extras/datasets/pandas/csv_dataset.py _load function by passing encoding with my specific case to the open context.

def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)

        with self._fs.open(load_path, encoding="latin_1", mode="r") as fs_file:
            return pd.read_csv(fs_file, **self._load_args)

Since most of my files have the same encoding it worked, despite not being ideal.

thanks, working as expected !

Hi @bensdm , apologies for the delay. Could you try adding the following config to your catalog entry?

fs_args:
    open_args_load:
        mode: "rb"

We use fsspec underneath and it needs to open the file in binary mode. Try passing encoding to open_args_load or load_args as well, and that should work. If no combination works please feel free to open a new issue.

Thanks for the quick reply @mzjp2

I modified /kedro/extras/datasets/pandas/csv_dataset.py _load function by passing encoding with my specific case to the open context.

def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)

        with self._fs.open(load_path, encoding="latin_1", mode="r") as fs_file:
            return pd.read_csv(fs_file, **self._load_args)

Since most of my files have the same encoding it worked, despite not being ideal.

Alternatively, you can do something like

def _load(self) -> pd.DataFrame:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)

        with self._fs.open(load_path, **fs_open_kwargs) as fs_file:
            return pd.read_csv(fs_file, **self._load_args)

in a similar vein to the way we do load_args or save_args then pass the relevant parameters inside your catalog.yml entry. 😃

Or, if you only care about encoding, then you can make encoding one of your __init__ args and pass just encoding=self._encoding to the self._fs.open call. Hope that makes sense!

I have the same error here while trying to read from S3 if my csv file is not encoded in utf-8! Any workarounds?

Hi @millengustavo and @deepyaman, thanks for raising this. We’re aware of this (and the other related issues from not being able to pass args to the fsspec open call) and it’s on our backlog to fix soon!

For now, the only workaround I can think of is creating a custom dataset from the dataset you want to use and overriding the load and save methods to pass the relevant stuff into the fsspec open call 😄