Fiona: Writing to GeoPackage is very slow

Working in GeoPandas I found an issue with GeoPackage format. It seems that it’s both GeoPandas’ and lower level’s fault. Can performance be improved at Fiona’s level?

I have SSD. OS is Ubuntu 17.04, Fiona version 1.7.9.post1. I installed it as recommended in #400 :

sudo pip3 install -I fiona --no-binary fiona

Here’s my test suite for a proof: https://github.com/culebron/geodata Run python3.6 few.py and python3.6 multiple.py to compare. few.py opens a file with a lot of data, but only 2.7K records as GeoDataFrame. It writes them into GeoJSON and GPKG. In this case, GPKG driver outperforms GeoJSON. multiple.py creates a 100K records dataframe and then saves it to GeoJSON and GPKG. Here, GPKG is incredibly slow.

My results with default GeoPandas:

$ python3.6 few.py 
writing 2.7K records to geojson 36.283805477003625
writing 2.7K records to gpkg    20.792497718997765
$ python3.6 multiple.py 
100%|████████████████████████████████████████████████████████| 100000/100000 [00:03<00:00, 29996.25it/s]
writing 100K records to geojson 61.62079200500011
writing 100K records to gpkg    260.4413645050008

And notice that in case of multiple.py, the resulting GeoPackage file is only 9 megs. Which is times smaller than the file produced by few.py.

As I understand, the problem is that Fiona opens a session in Sqlite and creates a lock file, and it takes some time. And inspecting the code, I see GeoPandas naively writes everything 1 record at a time, which means Sqlite honestly locks it, then writes, then unlocks:

https://github.com/geopandas/geopandas/blob/master/geopandas/io/file.py#L107

with fiona.drivers():
    with fiona.open(filename, 'w', driver=driver, crs=df.crs,
                    schema=schema, **kwargs) as colxn:
        for feature in df.iterfeatures():
            colxn.write(feature)

I tried batching records:

with fiona.drivers():
    with fiona.open(filename, 'w', driver=driver, crs=df.crs,
                    schema=schema, **kwargs) as colxn:
        buf = []
        for feature in df.iterfeatures():
            buf.append(feature)
            if len(buf) > 9999:
                colxn.writerecords(buf)
                buf = []

        colxn.writerecords(buf)

$ python3.6 multiple.py 
100%|████████████████████████████████████████████████████████| 100000/100000 [00:03<00:00, 31041.45it/s]
writing 100K records to gpkg    91.03069832099573

91 seconds instead of 260. But I still see that Sqlite locks and unlocks the db many times on the way, and actual writing speed is still slow.

Can this be improved?

You may try opening the resulting test.geojson in QGIS and saving it to GeoPackage from there, you’ll see it takes mere seconds.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 15 (13 by maintainers)

Most upvoted comments

The doc of the SQLite driver was out-dated / unclear. Fixed per https://trac.osgeo.org/gdal/changeset/40435.

rouault on Oct 14, 2017

@micahcochran thanks for the pointer! Seems like we (speaking for the project) need to prioritize the use of OGR dataset transactions for GPKG at least. This function and the corresponding commit function appear to be the keys:

http://www.gdal.org/gdal_8h.html#a57e354633ef531d521b674b4e5321369

sgillies on Sep 28, 2017