geopandas: [cython] specific case where new sjoin is much slower

@andreas-h reported a use case where the sjoin from the geopandas-cython branch is much slower than the current released version: https://gist.github.com/andreas-h/4906aea5d8ecffc9751e191cd11d00b4

I ran it locally and I can confirm this. It is joining 20,000 points with 44,000 polygons (this only takes ca 5s on master, but 30-60s on the cython branch).

I tried to profile it, but it seems to indicate that virtually all time is spent within the cython cysjoin function (and thus c sjoin fucntion). Which is also strange because also the actual pandas code in the user-facing sjoin function should take some time. I did not yet check that the actual results of both versions are the same; possibly one of both implementations is doing something wrong.

cc @mrocklin

@andreas-h could you simplify the example a little bit? (to not depend on the emiprepr library, eg just construct the polygons directly inside the notebook)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 18 (17 by maintainers)

Most upvoted comments

I re-ran these tests (gist), I’m posting here as well as in #1344 to try and give some closure to this issue.

Namely, I added PyGEOS which also uses GEOS’ STRTree but different Python binding and geometry data structures: build query

So it seems to me that most of the slowdown comes from Shapely/Python stuff, not GEOS.