tablesaw: Joins are too slow

Hi guys! I’m trying to migrate from python+pandas to kotlin+tablesaw. Some parts of my code are already working fast (like csv parsing, x2 times faster than in pandas) But also i’ve noticed that inner join operation is pretty slow (~ 2 times slower than in pandas)

Then i tried to optimize my code and use isIn Selection instead of simple join. Unfortunately it uses strings.toArray(new String[0]) under the hood for input parameter collection. It would be more sense to use HashSet to quicker lookup. So i wrote my own predicate:

val customersSet = customers.toHashSet()    // for faster lookup
val idColumn = transactions.stringColumn("customer_id")
idColumn.eval { it in customersSet }

Which is x15 times faster than original inner join. At least on my huge dataset. Of course this is much simpler than join operation since i haven’t appended columns etc. But still the difference is huge. I didn’t investigate joins code yet, but i hope there is space for improvements there. My key point is: kotlin+jvm should be at least not slower than python+pandas What do you guys think?

ps: do you use hash indexing on table columns?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 17

Most upvoted comments

I didn’t get this finished. It’s more complex than it first seemed. These notes are here for the next attempt.

There were two things contributing to slow joins.

  1. Indexes are recreated in a loop, once for each row in the main join table.
  2. The running time of the cross products seems to be sensitive to the number of times it is called. It is called once for each row in the main join table.

The second issue accounts for most of the time in the join.

I put through a fix for the first issue, which was to pre-create the indexes so they’re only created once per join. The performance is now independent of which table goes first.

While it sounds easy re-swap the tables in the join result, it’s difficult to make the result come out the same as it would have if the tables were handled in the order specified. The row order in the result is different, but I think that is an acceptable issue. It does seem useful, however to get the column ordering and naming to be the same. Achieving that, however, is made difficult by interactions between other already implemented features:

  • handling multi-table joins means the join logic called recursively if there are more than two tables involved.
  • handling tables with duplicate column names means that some of the columns in the result have different names than they have in the source tables. The algorithm for renaming the columns works by assigning a number to each table and prepending T[number]. to each duplicate column. This relies on the recursion to increment the table number for multi-table joins.
  • removing one of the join columns to avoid having two copies of the same data in the result means that you the join table doesn’t have all the columns in the original main table.
  • handling multi-column joins means there’s an arbitrary number of such missing columns in the result table.

Some of these are easier to deal with than others, but together they make this a non-trivial fix.

A left join with a schema similar to the one laid out above by @deviant-studio ddl19901201 now runs in about 300ms with PR #562

I would like to work on improving join performance.

I ran some performance tests with the following schema:

  • A customers table with 1,000 rows. (PK customerId)
  • An orders table with 1,000,000 rows. customerId is a FK that is uniformly distributed among the 1,000 customers.

I found one big culprit (especially when the large table is on the left side of the join)

Table table2Rows = table2.where(rowBitMapMultiCol);

Under the covers table2.where is copying all the columns in the table. This can get very slow with large numbers of columns. All the columns are then copied for a second time while calculating the cross product.

In one test where the orders table (large table) was on the left side of the join and the customers table had six columns the join took about ~10 seconds to complete of which ~8 seconds was taken up by Table table2Rows = table2.where(rowBitMapMultiCol);.

I am going to work on a PR that will.

  • Speed up joins in the worst case.
  • Make the join speed less sensitive to which table is first (I was seeing differences of 20x)
  • Reduce unnecessary copying of columns.
  • Change the join algorithm rather than fiddling with the renaming and reorder column logic.

@lwhite1 Let me know if you have any thoughts. Also let me know if want me to submit my performance tests.