tablesaw: Joins are too slow
Hi guys!
I’m trying to migrate from python+pandas to kotlin+tablesaw.
Some parts of my code are already working fast (like csv parsing, x2 times faster than in pandas)
But also i’ve noticed that inner join operation is pretty slow (~ 2 times slower than in pandas)
Then i tried to optimize my code and use isIn Selection instead of simple join. Unfortunately it uses strings.toArray(new String[0]) under the hood for input parameter collection. It would be more sense to use HashSet to quicker lookup. So i wrote my own predicate:
val customersSet = customers.toHashSet() // for faster lookup
val idColumn = transactions.stringColumn("customer_id")
idColumn.eval { it in customersSet }
Which is x15 times faster than original inner join. At least on my huge dataset. Of course this is much simpler than join operation since i haven’t appended columns etc. But still the difference is huge.
I didn’t investigate joins code yet, but i hope there is space for improvements there.
My key point is: kotlin+jvm should be at least not slower than python+pandas
What do you guys think?
ps: do you use hash indexing on table columns?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 17
I didn’t get this finished. It’s more complex than it first seemed. These notes are here for the next attempt.
There were two things contributing to slow joins.
The second issue accounts for most of the time in the join.
I put through a fix for the first issue, which was to pre-create the indexes so they’re only created once per join. The performance is now independent of which table goes first.
While it sounds easy re-swap the tables in the join result, it’s difficult to make the result come out the same as it would have if the tables were handled in the order specified. The row order in the result is different, but I think that is an acceptable issue. It does seem useful, however to get the column ordering and naming to be the same. Achieving that, however, is made difficult by interactions between other already implemented features:
Some of these are easier to deal with than others, but together they make this a non-trivial fix.
A left join with a schema similar to the one laid out above by @deviant-studio ddl19901201 now runs in about 300ms with PR #562
I would like to work on improving join performance.
I ran some performance tests with the following schema:
I found one big culprit (especially when the large table is on the left side of the join)
Under the covers table2.where is copying all the columns in the table. This can get very slow with large numbers of columns. All the columns are then copied for a second time while calculating the cross product.
In one test where the orders table (large table) was on the left side of the join and the customers table had six columns the join took about ~10 seconds to complete of which ~8 seconds was taken up by
Table table2Rows = table2.where(rowBitMapMultiCol);.I am going to work on a PR that will.
@lwhite1 Let me know if you have any thoughts. Also let me know if want me to submit my performance tests.