velox: HashAgg w/o spill doesn't outperform vanilla Spark
Description
In a customer workload profiling, we found hashAgg can’t outperform vanilla Spark, even in the case that no spill happens. In this query, the groupby keys are the primary keys of the table so aggregation doesn’t produce reduction at all. Here is the detailed plan stats:
-- Aggregation[FINAL [n0_0, n0_1] n1_2 := sum("n0_2"), n1_3 := sum("n0_3"), n1_4 := sum("n0_4"), n1_5 := sum("n0_5"), n1_6 := sum("n0_6"), n1_7 := sum("n0_7"), n1_8 := sum("n0_8"), n1_9 := sum("n0_9"), n1_10 := sum("n0_10")] -> n0_0:VARCHAR, n0_1:VARCHAR, n1_2:BIGINT, n1_3:BIGINT, n1_4:DOUBLE, n1_5:BIGINT, n1_6:DOUBLE, n1_7:BIGINT, n1_8:DOUBLE, n1_9:BIGINT, n1_10:DOUBLE
Output: 28319541 rows (6.01GB, 6914 batches), Cpu time: 37.48s, Blocked wall time: 0ns, Peak memory: 5.05GB, Memory allocations: 207459, Threads: 1
distinctKey0 sum: 1, count: 1, min: 1, max: 1
distinctKey1 sum: 32250, count: 1, min: 32250, max: 32250
hashtable.capacity sum: 33554432, count: 1, min: 33554432, max: 33554432
hashtable.numDistinct sum: 28319541, count: 1, min: 28319541, max: 28319541
hashtable.numRehashes sum: 14, count: 1, min: 14, max: 14
hashtable.numTombstones sum: 0, count: 1, min: 0, max: 0
runningAddInputWallNanos sum: 34.32s, count: 1, min: 34.32s, max: 34.32s
runningFinishWallNanos sum: 1.62us, count: 1, min: 1.62us, max: 1.62us
runningGetOutputWallNanos sum: 3.15s, count: 1, min: 3.15s, max: 3.15s
ping @FelixYBW @mbasmanova.
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 29 (18 by maintainers)
Thanks @Yuhta, I got to build successfully after adding this flag. @mbasmanova 7386 has no obvious help in my workload. FYI.
@yma11 Thank you for the update. Happy to hear that we are no longer slower than Spark and a tiny bit faster.
I also noticed that https://github.com/facebookincubator/velox/pull/7261 was reverted. We’ll need to work on getting it back in. CC: @Yuhta @oerling
@oerling @yma11 Discussed with Orri. He is going to add a fast path to HashStringAllocator to copy a single string.
@mbasmanova I applied this patch and E2E of the query is shortened to
40s
from56s
. So this change helps. But40s
is still not good enough and maybe caused by other code changes. Now I am switching to latest code and make deeper investigation. So will keep this ticket open util we figure out a good result.The aggregation number doesn’t quite matter. Multi string keys affect much.