kedro: DataCatalog can't read in shapefile
Description
I can’t read in shape files (shape file folder or .shp file) when specified them in conf/base/catalog.yml. If I import the GeoJSONDataSet, it works fine in a python script. Is it preferable to update the current implementation of GeoJSONDataSet or to create a new Extra Dataset Class for shape file?
Context
I have a shapefolder containing .shp, .shx, .dbj, etc. file. I can feed the folder path to geopandas or GeoJSONDataSet to read in as a geojson dataframe but when I specify the path in the data catalog and run the node, it gives me an error
Steps to Reproduce
- This is what works:
from kedro.extras.datasets.geopandas import GeoJSONDataSet
data = GeoJSONDataSet("sample_shape_file")
data
<kedro.extras.datasets.geopandas.geojson_dataset.GeoJSONDataSet at 0x1323e0dd0>
data_2 = GeoJSONDataSet("sample_shape_file/sample_shape_file.shp")
data_2
<kedro.extras.datasets.geopandas.geojson_dataset.GeoJSONDataSet at 0x1323f0950>
- Here’s what I have in my catalog.yml
shape_file_data:
type: geopandas.GeoJSONDataSet
filepath: data/01_raw/sample_shape_file
save_args={'driver': GeoJSON}
- Here’s the node in my pipeline:
node(
func=do_something,
inputs=[
"shape_file_data",
"params:location",
],
outputs="filtered_shape_file_data",
name="do_something",
),
- When I run
kedro run --node=do_something, I got this error:
2021-02-16 14:24:24,394 - root - INFO - ** Kedro project project_name
2021-02-16 14:24:24,434 - kedro.io.data_catalog - INFO - Loading data from `shape_file_data` (GeoJSONDataSet)...
2021-02-16 14:24:24,434 - kedro.runner.sequential_runner - WARNING - There are 1 nodes that have not run.
You can resume the pipeline run by adding the following argument to your previous command:
2021-02-16 14:24:24,442 - kedro.framework.session.store - INFO - `save()` not implemented for `BaseSessionStore`. Skipping the step.
Traceback (most recent call last):
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 208, in load
return self._load()
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/extras/datasets/geopandas/geojson_dataset.py", line 149, in _load
with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/spec.py", line 936, in open
**kwargs
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 117, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 199, in __init__
self._open()
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/fsspec/implementations/local.py", line 204, in _open
self.f = open(self.path, mode=self.mode)
IsADirectoryError: [Errno 21] Is a directory: '/Users/duongvu/project_name/data/01_raw/shape_file_data'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/duongvu/.pyenv/versions/env_name/bin/kedro", line 10, in <module>
sys.exit(main())
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 696, in main
cli_collection(**cli_context)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/duongvu/project_name/cli.py", line 212, in run
pipeline_name=pipeline,
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/framework/session/session.py", line 414, in run
run_result = runner.run(filtered_pipeline, catalog, run_id)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 100, in run
self._run(pipeline, catalog, run_id)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/sequential_runner.py", line 90, in _run
run_node(node, catalog, self._is_async, run_id)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 212, in run_node
node = _run_node_sequential(node, catalog, run_id)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/runner/runner.py", line 285, in _run_node_sequential
inputs[name] = catalog.load(name)
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 402, in load
result = func()
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 611, in load
return super().load()
File "/Users/duongvu/.pyenv/versions/3.7.7/envs/env_name/lib/python3.7/site-packages/kedro/io/core.py", line 217, in load
raise DataSetError(message) from exc
kedro.io.core.DataSetError: Failed while loading data from data set GeoJSONDataSet(filepath=/Users/duongvu/project_name/data/01_raw/shape_file_data, load_args={}, protocol=file, save_args={'driver': GeoJSON}).
[Errno 21] Is a directory: '/Users/duongvu/project_name/data/01_raw/shape_file_data'
- If I specify the .shp file inside like this:
shape_file_data:
type: geopandas.GeoJSONDataSet
filepath: data/01_raw/sample_shape_file/sample_shape_file.shp
save_args={'driver': GeoJSON}
I also got another error of:
kedro.io.core.DataSetError: Failed while loading data from data set GeoJSONDataSet(...) not recognized as a supported file format.
- I did try different save_args from “ESRI Shapefile” to None to “GeoJSON”. None works.
My Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used: 0.17.0
- Python version used: 3.7.7
- Operating system and version: MacOS
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (8 by maintainers)
Commits related to this issue
- [KED-1819] Add support for wheel version option for `kedro pipeline package` (#695) — committed to vishalbelsare/kedro by lorenabalan 4 years ago
From Geopandas documentation (link) there should be an easy solution:
.shpfile.shp,.shx,.dbf,.prjetc..shpfile path.Their example:
~I have not test yet using the catalog but it should work but using this snippet with one of my files works.~
It works.