Fiona: Writing to GeoPackage is very slow
Working in GeoPandas I found an issue with GeoPackage format. It seems that it’s both GeoPandas’ and lower level’s fault. Can performance be improved at Fiona’s level?
I have SSD. OS is Ubuntu 17.04, Fiona version 1.7.9.post1. I installed it as recommended in #400 :
sudo pip3 install -I fiona --no-binary fiona
Here’s my test suite for a proof: https://github.com/culebron/geodata Run python3.6 few.py and python3.6 multiple.py to compare. few.py opens a file with a lot of data, but only 2.7K records as GeoDataFrame. It writes them into GeoJSON and GPKG. In this case, GPKG driver outperforms GeoJSON. multiple.py creates a 100K records dataframe and then saves it to GeoJSON and GPKG. Here, GPKG is incredibly slow.
My results with default GeoPandas:
$ python3.6 few.py
writing 2.7K records to geojson 36.283805477003625
writing 2.7K records to gpkg 20.792497718997765
$ python3.6 multiple.py
100%|████████████████████████████████████████████████████████| 100000/100000 [00:03<00:00, 29996.25it/s]
writing 100K records to geojson 61.62079200500011
writing 100K records to gpkg 260.4413645050008
And notice that in case of multiple.py, the resulting GeoPackage file is only 9 megs. Which is times smaller than the file produced by few.py.
As I understand, the problem is that Fiona opens a session in Sqlite and creates a lock file, and it takes some time. And inspecting the code, I see GeoPandas naively writes everything 1 record at a time, which means Sqlite honestly locks it, then writes, then unlocks:
https://github.com/geopandas/geopandas/blob/master/geopandas/io/file.py#L107
with fiona.drivers():
with fiona.open(filename, 'w', driver=driver, crs=df.crs,
schema=schema, **kwargs) as colxn:
for feature in df.iterfeatures():
colxn.write(feature)
I tried batching records:
with fiona.drivers():
with fiona.open(filename, 'w', driver=driver, crs=df.crs,
schema=schema, **kwargs) as colxn:
buf = []
for feature in df.iterfeatures():
buf.append(feature)
if len(buf) > 9999:
colxn.writerecords(buf)
buf = []
colxn.writerecords(buf)
$ python3.6 multiple.py
100%|████████████████████████████████████████████████████████| 100000/100000 [00:03<00:00, 31041.45it/s]
writing 100K records to gpkg 91.03069832099573
91 seconds instead of 260. But I still see that Sqlite locks and unlocks the db many times on the way, and actual writing speed is still slow.
Can this be improved?
You may try opening the resulting test.geojson in QGIS and saving it to GeoPackage from there, you’ll see it takes mere seconds.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 15 (13 by maintainers)
The doc of the SQLite driver was out-dated / unclear. Fixed per https://trac.osgeo.org/gdal/changeset/40435.
@micahcochran thanks for the pointer! Seems like we (speaking for the project) need to prioritize the use of OGR dataset transactions for GPKG at least. This function and the corresponding commit function appear to be the keys:
http://www.gdal.org/gdal_8h.html#a57e354633ef531d521b674b4e5321369