ClickHouse: Nearly 20% decompression performance degradation
I’m working on Clickhouse-based Backend for Project Gluten and recently saw a significant performance degradation in the new version of ClickHouse when merging branches 22.3.
Performance Test About LZ4 Decompress
There are three experimental scenarios, all of which use the same data set TPCH 10, LineItem. Columns:
- l_quantity
- l_extendedprice
- l_discount
- l_shipdate row count: 60,000,000
The experimental scene:
1.decompres performance test, using the implementation in utils/compressor, the decompression algorithms 0,1,2,3 in lz4 are tested
2.MergeTree Read test,Build a plan with only ReadFromMergeTree in the code.
3.TPCH Q6 test,Build QueryPlan of TPCH Q6 through code and test it.
Summary First
There is a performance degradation of over 10% in 22.3 compared to an earlier version such as 21.4, and the current testing proves that the performance degradation is due to the Lz4 decompression algorithm. Through code discovery, new code changes are made to address some security issues, but the logic here is not complex, and there is a significant performance degradation.
22.3 remove bound check
code of decompressImpl:
template <size_t copy_amount, bool use_shuffle>
bool NO_INLINE decompressImpl(
const char * const source,
char * const dest,
size_t source_size,
size_t dest_size)
{
const UInt8 * ip = reinterpret_cast<const UInt8 *>(source);
UInt8 * op = reinterpret_cast<UInt8 *>(dest);
const UInt8 * const input_end = ip + source_size;
UInt8 * const output_begin = op;
UInt8 * const output_end = op + dest_size;
/// Unrolling with clang is doing >10% performance degrade.
#if defined(__clang__)
#pragma nounroll
#endif
while (true)
{
size_t length;
auto continue_read_length = [&]
{
unsigned s;
do
{
s = *ip++;
length += s;
} while (unlikely(s == 255 && ip < input_end));
};
/// Get literal length.
const unsigned token = *ip++;
length = token >> 4;
if (length == 0x0F)
continue_read_length();
/// Copy literals.
UInt8 * copy_end = op + length;
/// input: Hello, world
/// ^-ip
/// output: xyz
/// ^-op ^-copy_end
/// output: xyzHello, w
/// ^- excessive copied bytes due to "wildCopy"
/// input: Hello, world
/// ^-ip
/// output: xyzHello, w
/// ^-op (we will overwrite excessive bytes on next iteration)
if (unlikely(copy_end > output_end))
return false;
// Due to implementation specifics the copy length is always a multiple of copy_amount
size_t real_length = 0;
static_assert(copy_amount == 8 || copy_amount == 16 || copy_amount == 32);
if constexpr (copy_amount == 8)
real_length = (((length >> 3) + 1) * 8);
else if constexpr (copy_amount == 16)
real_length = (((length >> 4) + 1) * 16);
else if constexpr (copy_amount == 32)
real_length = (((length >> 5) + 1) * 32);
if (unlikely(ip + real_length >= input_end + ADDITIONAL_BYTES_AT_END_OF_BUFFER))
return false;
wildCopy<copy_amount>(op, ip, copy_end); /// Here we can write up to copy_amount - 1 bytes after buffer.
if (copy_end == output_end)
return true;
ip += length;
op = copy_end;
if (copy_end >= output_end)
return true;
/// Get match offset.
size_t offset = unalignedLoad<UInt16>(ip);
ip += 2;
const UInt8 * match = op - offset;
/// Get match length.
length = token & 0x0F;
if (length == 0x0F)
continue_read_length();
length += 4;
/// Copy match within block, that produce overlapping pattern. Match may replicate itself.
copy_end = op + length;
/** Here we can write up to copy_amount - 1 - 4 * 2 bytes after buffer.
* The worst case when offset = 1 and length = 4
*/
if (unlikely(offset < copy_amount))
{
/// output: Hello
/// ^-op
/// ^-match; offset = 5
///
/// output: Hello
/// [------] - copy_amount bytes
/// [------] - copy them here
///
/// output: HelloHelloHel
/// ^-match ^-op
copyOverlap<copy_amount, use_shuffle>(op, match, offset);
}
else
{
copy<copy_amount>(op, match);
match += copy_amount;
}
op += copy_amount;
copy<copy_amount>(op, match); /// copy_amount + copy_amount - 1 - 4 * 2 bytes after buffer.
if (length > copy_amount * 2)
wildCopy<copy_amount>(op + copy_amount, match + copy_amount, copy_end);
op = copy_end;
}
}
}
decompress performance
! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:52:16+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 1.20, 0.75, 0.67
---------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------
BM_TestDecompress/0/iterations:50/repeats:6 551 ms 551 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 549 ms 549 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 549 ms 549 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 552 ms 552 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 553 ms 553 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 547 ms 547 ms 50
BM_TestDecompress/0/iterations:50/repeats:6_mean 550 ms 550 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_median 550 ms 550 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_stddev 2.14 ms 2.14 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_80% 553 ms 553 ms 6
BM_TestDecompress/1/iterations:50/repeats:6 667 ms 667 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 694 ms 694 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 668 ms 668 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 670 ms 670 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 677 ms 677 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 670 ms 670 ms 50
BM_TestDecompress/1/iterations:50/repeats:6_mean 674 ms 674 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_median 670 ms 670 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_stddev 10.4 ms 10.4 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_80% 677 ms 677 ms 6
BM_TestDecompress/2/iterations:50/repeats:6 680 ms 680 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 693 ms 693 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 731 ms 731 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 699 ms 699 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 688 ms 688 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 684 ms 684 ms 50
BM_TestDecompress/2/iterations:50/repeats:6_mean 696 ms 696 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_median 691 ms 691 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_stddev 18.7 ms 18.7 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_80% 688 ms 688 ms 6
BM_TestDecompress/3/iterations:50/repeats:6 813 ms 813 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 817 ms 817 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 813 ms 813 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 815 ms 815 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 816 ms 816 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 816 ms 816 ms 50
BM_TestDecompress/3/iterations:50/repeats:6_mean 815 ms 815 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_median 815 ms 815 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_stddev 1.55 ms 1.55 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_80% 816 ms 816 ms 6
MergeTree Read Performance
! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:36:00+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.17, 0.70, 0.86
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------------------
BM_MergeTreeRead/2/iterations:50/repeats:6 499 ms 499 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 499 ms 499 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 505 ms 505 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 498 ms 498 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 500 ms 500 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 498 ms 498 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6_mean 500 ms 500 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_median 499 ms 499 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_stddev 2.68 ms 2.68 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_80% 500 ms 500 ms 6
TPCH Q6 performance
! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:27:40+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 1.99, 1.15, 0.94
-----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1149 ms 1149 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1144 ms 1144 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1143 ms 1143 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1140 ms 1140 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1142 ms 1142 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1136 ms 1136 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_mean 1142 ms 1142 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_median 1143 ms 1143 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_stddev 4.34 ms 4.34 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_80% 1142 ms 1142 ms 6
22.3 Revert ALL
code of decompressImpl:
template <size_t copy_amount, bool use_shuffle>
bool NO_INLINE decompressImpl(
const char * const source,
char * const dest,
size_t source_size,
size_t dest_size)
{
const UInt8 * ip = reinterpret_cast<const UInt8 *>(source);
UInt8 * op = reinterpret_cast<UInt8 *>(dest);
const UInt8 * const input_end = ip + source_size;
UInt8 * const output_begin = op;
UInt8 * const output_end = op + dest_size;
/// Unrolling with clang is doing >10% performance degrade.
#if defined(__clang__)
#pragma nounroll
#endif
while (true)
{
size_t length;
auto continue_read_length = [&]
{
unsigned s;
do
{
s = *ip++;
length += s;
} while (unlikely(s == 255));
};
/// Get literal length.
const unsigned token = *ip++;
length = token >> 4;
if (length == 0x0F)
continue_read_length();
/// Copy literals.
UInt8 * copy_end = op + length;
/// input: Hello, world
/// ^-ip
/// output: xyz
/// ^-op ^-copy_end
/// output: xyzHello, w
/// ^- excessive copied bytes due to "wildCopy"
/// input: Hello, world
/// ^-ip
/// output: xyzHello, w
/// ^-op (we will overwrite excessive bytes on next iteration)
wildCopy<copy_amount>(op, ip, copy_end); /// Here we can write up to copy_amount - 1 bytes after buffer.
ip += length;
op = copy_end;
if (copy_end >= output_end)
return true;
/// Get match offset.
size_t offset = unalignedLoad<UInt16>(ip);
ip += 2;
const UInt8 * match = op - offset;
/// Get match length.
length = token & 0x0F;
if (length == 0x0F)
continue_read_length();
length += 4;
/// Copy match within block, that produce overlapping pattern. Match may replicate itself.
copy_end = op + length;
/** Here we can write up to copy_amount - 1 - 4 * 2 bytes after buffer.
* The worst case when offset = 1 and length = 4
*/
if (unlikely(offset < copy_amount))
{
/// output: Hello
/// ^-op
/// ^-match; offset = 5
///
/// output: Hello
/// [------] - copy_amount bytes
/// [------] - copy them here
///
/// output: HelloHelloHel
/// ^-match ^-op
copyOverlap<copy_amount, use_shuffle>(op, match, offset);
}
else
{
copy<copy_amount>(op, match);
match += copy_amount;
}
op += copy_amount;
copy<copy_amount>(op, match); /// copy_amount + copy_amount - 1 - 4 * 2 bytes after buffer.
if (length > copy_amount * 2)
wildCopy<copy_amount>(op + copy_amount, match + copy_amount, copy_end);
op = copy_end;
}
}
}
decompress performance
! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T13:08:07+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.16, 0.36, 0.61
---------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------
BM_TestDecompress/0/iterations:50/repeats:6 518 ms 518 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 515 ms 515 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 516 ms 516 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 520 ms 520 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 524 ms 524 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 516 ms 516 ms 50
BM_TestDecompress/0/iterations:50/repeats:6_mean 518 ms 518 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_median 517 ms 517 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_stddev 3.38 ms 3.38 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_80% 524 ms 524 ms 6
BM_TestDecompress/1/iterations:50/repeats:6 476 ms 476 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 476 ms 476 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 477 ms 477 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 480 ms 480 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 480 ms 480 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 478 ms 478 ms 50
BM_TestDecompress/1/iterations:50/repeats:6_mean 478 ms 478 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_median 477 ms 477 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_stddev 1.81 ms 1.81 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_80% 480 ms 480 ms 6
BM_TestDecompress/2/iterations:50/repeats:6 589 ms 589 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 584 ms 584 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 581 ms 581 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 586 ms 586 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 581 ms 581 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 584 ms 584 ms 50
BM_TestDecompress/2/iterations:50/repeats:6_mean 584 ms 584 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_median 584 ms 584 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_stddev 2.83 ms 2.83 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_80% 581 ms 581 ms 6
BM_TestDecompress/3/iterations:50/repeats:6 636 ms 636 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 635 ms 635 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 634 ms 634 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 636 ms 636 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 635 ms 635 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 632 ms 632 ms 50
BM_TestDecompress/3/iterations:50/repeats:6_mean 635 ms 635 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_median 635 ms 635 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_stddev 1.37 ms 1.37 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_80% 635 ms 635 ms 6
MergeTree Read Performance
! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T12:58:30+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 1.91, 1.24, 0.93
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------------------
BM_MergeTreeRead/2/iterations:50/repeats:6 445 ms 445 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 447 ms 447 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 460 ms 460 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 462 ms 462 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 452 ms 452 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 461 ms 461 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6_mean 454 ms 454 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_median 456 ms 456 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_stddev 7.57 ms 7.57 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_80% 452 ms 452 ms 6
TPCH Q6 performance
! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T13:47:10+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 1.54, 1.00, 0.76
-----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1093 ms 1093 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1078 ms 1078 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1098 ms 1098 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1079 ms 1079 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1069 ms 1069 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1113 ms 1113 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_mean 1088 ms 1088 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_median 1086 ms 1086 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_stddev 16.0 ms 16.0 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_80% 1069 ms 1069 ms 6
22.3 no changes
branch of 22.3
decompress performance
! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:01:35+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.38, 0.19, 0.13
---------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------
BM_TestDecompress/0/iterations:50/repeats:6 630 ms 627 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 626 ms 626 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 626 ms 626 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 623 ms 623 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 633 ms 633 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 628 ms 628 ms 50
BM_TestDecompress/0/iterations:50/repeats:6_mean 628 ms 627 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_median 627 ms 627 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_stddev 3.32 ms 3.17 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_80% 633 ms 633 ms 6
BM_TestDecompress/1/iterations:50/repeats:6 580 ms 580 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 577 ms 577 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 581 ms 581 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 578 ms 578 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 579 ms 579 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 578 ms 578 ms 50
BM_TestDecompress/1/iterations:50/repeats:6_mean 579 ms 579 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_median 579 ms 579 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_stddev 1.59 ms 1.59 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_80% 579 ms 579 ms 6
BM_TestDecompress/2/iterations:50/repeats:6 800 ms 800 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 802 ms 802 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 801 ms 801 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 833 ms 833 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 810 ms 810 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 798 ms 798 ms 50
BM_TestDecompress/2/iterations:50/repeats:6_mean 807 ms 807 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_median 801 ms 801 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_stddev 13.3 ms 13.3 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_80% 810 ms 810 ms 6
BM_TestDecompress/3/iterations:50/repeats:6 716 ms 716 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 721 ms 721 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 733 ms 733 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 747 ms 747 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 720 ms 720 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 719 ms 719 ms 50
BM_TestDecompress/3/iterations:50/repeats:6_mean 726 ms 726 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_median 720 ms 720 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_stddev 11.6 ms 11.6 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_80% 720 ms 720 ms 6
MergeTree Read Performance
! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T10:19:21+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 2.16, 1.23, 0.97
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------------------
BM_MergeTreeRead/2/iterations:50/repeats:6 548 ms 548 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 521 ms 521 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 521 ms 521 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 522 ms 522 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 518 ms 518 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 513 ms 513 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6_mean 524 ms 524 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_median 521 ms 521 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_stddev 12.1 ms 12.1 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_80% 518 ms 518 ms 6
TPCH Q6 performance
! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:17:27+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 2.03, 1.30, 0.83
-----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1190 ms 1190 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1185 ms 1185 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1187 ms 1187 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1176 ms 1176 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1166 ms 1166 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1168 ms 1168 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_mean 1179 ms 1179 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_median 1180 ms 1180 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_stddev 10.2 ms 10.2 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_80% 1166 ms 1166 ms 6
Old Version
base on commit fd29ad7: Merge pull request #26871 from kssenii/rabbit-fix-sink …
decompress performance
!/home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T12:24:01+08:00
Running /home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.40, 0.53, 0.76
---------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------
BM_TestDecompress/0/iterations:50/repeats:6 531 ms 531 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 520 ms 520 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 521 ms 521 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 519 ms 519 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 519 ms 519 ms 50
BM_TestDecompress/0/iterations:50/repeats:6 523 ms 523 ms 50
BM_TestDecompress/0/iterations:50/repeats:6_mean 522 ms 522 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_median 521 ms 521 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_stddev 4.37 ms 4.37 ms 6
BM_TestDecompress/0/iterations:50/repeats:6_80% 519 ms 519 ms 6
BM_TestDecompress/1/iterations:50/repeats:6 474 ms 474 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 476 ms 476 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 474 ms 474 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 480 ms 480 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 502 ms 502 ms 50
BM_TestDecompress/1/iterations:50/repeats:6 478 ms 478 ms 50
BM_TestDecompress/1/iterations:50/repeats:6_mean 481 ms 481 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_median 477 ms 477 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_stddev 10.7 ms 10.7 ms 6
BM_TestDecompress/1/iterations:50/repeats:6_80% 502 ms 502 ms 6
BM_TestDecompress/2/iterations:50/repeats:6 582 ms 582 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 584 ms 584 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 583 ms 583 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 579 ms 579 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 584 ms 584 ms 50
BM_TestDecompress/2/iterations:50/repeats:6 583 ms 583 ms 50
BM_TestDecompress/2/iterations:50/repeats:6_mean 583 ms 583 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_median 583 ms 583 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_stddev 1.63 ms 1.63 ms 6
BM_TestDecompress/2/iterations:50/repeats:6_80% 584 ms 584 ms 6
BM_TestDecompress/3/iterations:50/repeats:6 634 ms 634 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 631 ms 631 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 631 ms 631 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 631 ms 631 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 630 ms 630 ms 50
BM_TestDecompress/3/iterations:50/repeats:6 638 ms 638 ms 50
BM_TestDecompress/3/iterations:50/repeats:6_mean 632 ms 632 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_median 631 ms 631 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_stddev 2.84 ms 2.84 ms 6
BM_TestDecompress/3/iterations:50/repeats:6_80% 630 ms 630 ms 6
MergeTree Read Performance
!/home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T12:17:00+08:00
Running /home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 1.62, 1.22, 0.98
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------------------
BM_MergeTreeRead/2/iterations:50/repeats:6 436 ms 436 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 440 ms 440 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 437 ms 437 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 436 ms 436 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 441 ms 441 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6 438 ms 438 ms 50
BM_MergeTreeRead/2/iterations:50/repeats:6_mean 438 ms 438 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_median 438 ms 438 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_stddev 2.00 ms 2.01 ms 6
BM_MergeTreeRead/2/iterations:50/repeats:6_80% 441 ms 441 ms 6
TPCH Q6 performance
!/home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T12:09:54+08:00
Running /home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.14, 0.63, 0.78
-----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1070 ms 1070 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1071 ms 1071 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1077 ms 1077 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1079 ms 1079 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1071 ms 1071 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6 1074 ms 1074 ms 50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_mean 1074 ms 1074 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_median 1072 ms 1072 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_stddev 3.94 ms 3.94 ms 6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_80% 1071 ms 1071 ms 6
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 8
- Comments: 22 (21 by maintainers)
#40142 covers the case we observed on TPCH Q6. I found one more case that will be addressed in subsequent pr.