kedro: DataCatalog can't read in shapefile

Description

I can’t read in shape files (shape file folder or .shp file) when specified them in conf/base/catalog.yml. If I import the GeoJSONDataSet, it works fine in a python script. Is it preferable to update the current implementation of GeoJSONDataSet or to create a new Extra Dataset Class for shape file?

Context

I have a shapefolder containing .shp, .shx, .dbj, etc. file. I can feed the folder path to geopandas or GeoJSONDataSet to read in as a geojson dataframe but when I specify the path in the data catalog and run the node, it gives me an error

Steps to Reproduce

  1. This is what works:
from kedro.extras.datasets.geopandas import GeoJSONDataSet
data = GeoJSONDataSet("sample_shape_file")
data
<kedro.extras.datasets.geopandas.geojson_dataset.GeoJSONDataSet at 0x1323e0dd0>

data_2 = GeoJSONDataSet("sample_shape_file/sample_shape_file.shp")
data_2
<kedro.extras.datasets.geopandas.geojson_dataset.GeoJSONDataSet at 0x1323f0950>
  1. Here’s what I have in my catalog.yml
shape_file_data:
  type: geopandas.GeoJSONDataSet
  filepath: data/01_raw/sample_shape_file
  save_args={'driver': GeoJSON}
  1. Here’s the node in my pipeline:
node(
                func=do_something,
                inputs=[
                    "shape_file_data",
                    "params:location",
                ],
                outputs="filtered_shape_file_data",
                name="do_something",
            ),
  1. When I run kedro run --node=do_something, I got this error:
2021-02-16 14:24:24,394 - root - INFO - ** Kedro project project_name
2021-02-16 14:24:24,434 - kedro.io.data_catalog - INFO - Loading data from `shape_file_data` (GeoJSONDataSet)...
2021-02-16 14:24:24,434 - kedro.runner.sequential_runner - WARNING - There are 1 nodes that have not run.
You can resume the pipeline run by adding the following argument to your previous command:

2021-02-16 14:24:24,442 - kedro.framework.session.store - INFO - `save()` not implemented for `BaseSessionStore`. Skipping the step.
Traceback (most recent call last):
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 208, in load
    return self._load()
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/extras/datasets/geopandas/geojson_dataset.py", line 149, in _load
    with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/spec.py", line 936, in open
    **kwargs
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 117, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 199, in __init__
    self._open()
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 204, in _open
    self.f = open(self.path, mode=self.mode)
IsADirectoryError: [Errno 21] Is a directory: '/Users/duongvu/project_name/data/01_raw/shape_file_data'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/duongvu/.pyenv/versions/env_name/bin/kedro", line 10, in <module>
    sys.exit(main())
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 696, in main
    cli_collection(**cli_context)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/duongvu/project_name/cli.py", line 212, in run
    pipeline_name=pipeline,
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/framework/session/session.py", line 414, in run
    run_result = runner.run(filtered_pipeline, catalog, run_id)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 100, in run
    self._run(pipeline, catalog, run_id)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/sequential_runner.py", line 90, in _run
    run_node(node, catalog, self._is_async, run_id)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 212, in run_node
    node = _run_node_sequential(node, catalog, run_id)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 285, in _run_node_sequential
    inputs[name] = catalog.load(name)
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 402, in load
    result = func()
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 611, in load
    return super().load()
  File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 217, in load
    raise DataSetError(message) from exc
kedro.io.core.DataSetError: Failed while loading data from data set GeoJSONDataSet(filepath=/Users/duongvu/project_name/data/01_raw/shape_file_data, load_args={}, protocol=file, save_args={'driver': GeoJSON}).
[Errno 21] Is a directory: '/Users/duongvu/project_name/data/01_raw/shape_file_data'
  1. If I specify the .shp file inside like this:
shape_file_data:
  type: geopandas.GeoJSONDataSet
  filepath: data/01_raw/sample_shape_file/sample_shape_file.shp
  save_args={'driver': GeoJSON}

I also got another error of:

kedro.io.core.DataSetError: Failed while loading data from data set GeoJSONDataSet(...) not recognized as a supported file format.
  1. I did try different save_args from “ESRI Shapefile” to None to “GeoJSON”. None works.

My Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used: 0.17.0
  • Python version used: 3.7.7
  • Operating system and version: MacOS

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (8 by maintainers)

Commits related to this issue

Most upvoted comments

From Geopandas documentation (link) there should be an easy solution:

  • Instead of passing a path to the to .shp file
  • Zip all the files together (inside a folder): .shp, .shx, .dbf, .prj etc.
  • Use the zip path instead of the .shp file path.

Their example:

path = "simplecache::http://download.geofabrik.de/antarctica-latest-free.shp.zip"
with fsspec.open(path) as file:
    df = geopandas.read_file(file)

~I have not test yet using the catalog but it should work but using this snippet with one of my files works.~

It works.