kedro: [KED-1473] `pandas.CSVDataSet` doesn't support `encoding` parameter
Description
I’m unable to load a non-utf-8-encoded file.
I know it doesn’t work because of https://github.com/quantumblacklabs/kedro/blob/f03226e29b8a018a0f6edab6d3f1a0d37c1b1812/kedro/extras/datasets/pandas/csv_dataset.py#L154-L155. For it to work, encoding would have to be passed to open. However, some file systems don’t support an encoding parameter… (e.g. gcsfs, I think).
Steps to Reproduce
Try loading https://github.com/beoutbreakprepared/nCoV2019/blob/433628fb828f3b3b3bff7d13195af357fe42e31d/ncov_outside_hubei.csv as a CSVDataSet.
Expected Result
I can load a cp1252-encoded file directly with pandas:
pd.read_csv("data/01_raw/nCoV2019/ncov_outside_hubei/20200304/ncov_outside_hubei.csv", encoding="cp1252")
Actual Result
I’m unable to load a cp1252-encoded file using Kedro:
DataSetError: Failed while loading data from data set CSVDataSet(filepath=data/01_raw/nCoV2019/ncov_outside_hubei/20200304/ncov_outside_hubei.csv, load_args={'encoding': cp1252, 'low_memory': False}, protocol=file, save_args={'index': False}).
'utf-8' codec can't decode byte 0xa0 in position 162456: invalid start byte
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used (
pip show kedroorkedro -V):0.15.8 - Python version used (
python -V):3.7.6 - Operating system and version: macOS Mojave Version
10.14.6
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (6 by maintainers)
Thanks for the quick reply @mzjp2
I modified
/kedro/extras/datasets/pandas/csv_dataset.py_load function by passing encoding with my specific case to the open context.Since most of my files have the same encoding it worked, despite not being ideal.
thanks, working as expected !
Hi @bensdm , apologies for the delay. Could you try adding the following config to your catalog entry?
We use
fsspecunderneath and it needs to open the file in binary mode. Try passingencodingtoopen_args_loadorload_argsas well, and that should work. If no combination works please feel free to open a new issue.Alternatively, you can do something like
in a similar vein to the way we do
load_argsorsave_argsthen pass the relevant parameters inside yourcatalog.ymlentry. 😃Or, if you only care about encoding, then you can make
encodingone of your__init__args and pass justencoding=self._encodingto theself._fs.opencall. Hope that makes sense!Hi @millengustavo and @deepyaman, thanks for raising this. We’re aware of this (and the other related issues from not being able to pass args to the fsspec open call) and it’s on our backlog to fix soon!
For now, the only workaround I can think of is creating a custom dataset from the dataset you want to use and overriding the load and save methods to pass the relevant stuff into the fsspec open call 😄