velox: HashAgg w/o spill doesn't outperform vanilla Spark

Description

In a customer workload profiling, we found hashAgg can’t outperform vanilla Spark, even in the case that no spill happens. In this query, the groupby keys are the primary keys of the table so aggregation doesn’t produce reduction at all. Here is the detailed plan stats:

    -- Aggregation[FINAL [n0_0, n0_1] n1_2 := sum("n0_2"), n1_3 := sum("n0_3"), n1_4 := sum("n0_4"), n1_5 := sum("n0_5"), n1_6 := sum("n0_6"), n1_7 := sum("n0_7"), n1_8 := sum("n0_8"), n1_9 := sum("n0_9"), n1_10 := sum("n0_10")] -> n0_0:VARCHAR, n0_1:VARCHAR, n1_2:BIGINT, n1_3:BIGINT, n1_4:DOUBLE, n1_5:BIGINT, n1_6:DOUBLE, n1_7:BIGINT, n1_8:DOUBLE, n1_9:BIGINT, n1_10:DOUBLE
       Output: 28319541 rows (6.01GB, 6914 batches), Cpu time: 37.48s, Blocked wall time: 0ns, Peak memory: 5.05GB, Memory allocations: 207459, Threads: 1
          distinctKey0                 sum: 1, count: 1, min: 1, max: 1
          distinctKey1                 sum: 32250, count: 1, min: 32250, max: 32250
          hashtable.capacity           sum: 33554432, count: 1, min: 33554432, max: 33554432
          hashtable.numDistinct        sum: 28319541, count: 1, min: 28319541, max: 28319541
          hashtable.numRehashes        sum: 14, count: 1, min: 14, max: 14
          hashtable.numTombstones      sum: 0, count: 1, min: 0, max: 0
          runningAddInputWallNanos     sum: 34.32s, count: 1, min: 34.32s, max: 34.32s
          runningFinishWallNanos       sum: 1.62us, count: 1, min: 1.62us, max: 1.62us
          runningGetOutputWallNanos    sum: 3.15s, count: 1, min: 3.15s, max: 3.15s

ping @FelixYBW @mbasmanova.

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Comments: 29 (18 by maintainers)

Commits related to this issue

Most upvoted comments

Are you missing -mavx2 in your build?

Thanks @Yuhta, I got to build successfully after adding this flag. @mbasmanova 7386 has no obvious help in my workload. FYI.

@yma11 Thank you for the update. Happy to hear that we are no longer slower than Spark and a tiny bit faster.

I also noticed that https://github.com/facebookincubator/velox/pull/7261 was reverted. We’ll need to work on getting it back in. CC: @Yuhta @oerling

@oerling @yma11 Discussed with Orri. He is going to add a fast path to HashStringAllocator to copy a single string.

@yma11 Any chance you could you try to apply #7261 to see if it helps?

@mbasmanova I applied this patch and E2E of the query is shortened to 40s from 56s. So this change helps. But 40s is still not good enough and maybe caused by other code changes. Now I am switching to latest code and make deeper investigation. So will keep this ticket open util we figure out a good result.

The aggregation number doesn’t quite matter. Multi string keys affect much.

can you also clarify what are the lengths of the strings? Are these roughly 80 bytes each? Seems it’s not so long but 32 bytes, like 256d12eb37ee1fed458dce8cfd6b67f1.