velox: HashAgg w/o spill doesn't outperform vanilla Spark

Description

In a customer workload profiling, we found hashAgg can’t outperform vanilla Spark, even in the case that no spill happens. In this query, the groupby keys are the primary keys of the table so aggregation doesn’t produce reduction at all. Here is the detailed plan stats:

    -- Aggregation[FINAL [n0_0, n0_1] n1_2 := sum("n0_2"), n1_3 := sum("n0_3"), n1_4 := sum("n0_4"), n1_5 := sum("n0_5"), n1_6 := sum("n0_6"), n1_7 := sum("n0_7"), n1_8 := sum("n0_8"), n1_9 := sum("n0_9"), n1_10 := sum("n0_10")] -> n0_0:VARCHAR, n0_1:VARCHAR, n1_2:BIGINT, n1_3:BIGINT, n1_4:DOUBLE, n1_5:BIGINT, n1_6:DOUBLE, n1_7:BIGINT, n1_8:DOUBLE, n1_9:BIGINT, n1_10:DOUBLE
       Output: 28319541 rows (6.01GB, 6914 batches), Cpu time: 37.48s, Blocked wall time: 0ns, Peak memory: 5.05GB, Memory allocations: 207459, Threads: 1
          distinctKey0                 sum: 1, count: 1, min: 1, max: 1
          distinctKey1                 sum: 32250, count: 1, min: 32250, max: 32250
          hashtable.capacity           sum: 33554432, count: 1, min: 33554432, max: 33554432
          hashtable.numDistinct        sum: 28319541, count: 1, min: 28319541, max: 28319541
          hashtable.numRehashes        sum: 14, count: 1, min: 14, max: 14
          hashtable.numTombstones      sum: 0, count: 1, min: 0, max: 0
          runningAddInputWallNanos     sum: 34.32s, count: 1, min: 34.32s, max: 34.32s
          runningFinishWallNanos       sum: 1.62us, count: 1, min: 1.62us, max: 1.62us
          runningGetOutputWallNanos    sum: 3.15s, count: 1, min: 3.15s, max: 3.15s

ping @FelixYBW @mbasmanova.

About this issue

Original URL
State: open
Created 8 months ago
Comments: 29 (18 by maintainers)

Commits related to this issue

Faster StringView hash (#7261) Summary: Replaces folly spookyHash with a combination of CRC32 intrinnsics. Most X86 implementations can run 3 data independent CRC32 instructions at a time. The loop ... — committed to facebookincubator/velox by deleted user 8 months ago

Most upvoted comments

Are you missing -mavx2 in your build?

Thanks @Yuhta, I got to build successfully after adding this flag. @mbasmanova 7386 has no obvious help in my workload. FYI.

yma11 on Nov 8, 2023

@yma11 Thank you for the update. Happy to hear that we are no longer slower than Spark and a tiny bit faster.

I also noticed that https://github.com/facebookincubator/velox/pull/7261 was reverted. We’ll need to work on getting it back in. CC: @Yuhta @oerling

mbasmanova on Nov 2, 2023

@oerling @yma11 Discussed with Orri. He is going to add a fast path to HashStringAllocator to copy a single string.

mbasmanova on Oct 31, 2023

@yma11 Any chance you could you try to apply #7261 to see if it helps?

@mbasmanova I applied this patch and E2E of the query is shortened to 40s from 56s. So this change helps. But 40s is still not good enough and maybe caused by other code changes. Now I am switching to latest code and make deeper investigation. So will keep this ticket open util we figure out a good result.

yma11 on Oct 30, 2023

The aggregation number doesn’t quite matter. Multi string keys affect much.

yma11 on Oct 26, 2023

can you also clarify what are the lengths of the strings? Are these roughly 80 bytes each? Seems it’s not so long but 32 bytes, like 256d12eb37ee1fed458dce8cfd6b67f1.

yma11 on Oct 26, 2023