ClickHouse: Nearly 20% decompression performance degradation

I’m working on Clickhouse-based Backend for Project Gluten and recently saw a significant performance degradation in the new version of ClickHouse when merging branches 22.3.

Performance Test About LZ4 Decompress

There are three experimental scenarios, all of which use the same data set TPCH 10, LineItem. Columns:

  • l_quantity
  • l_extendedprice
  • l_discount
  • l_shipdate row count: 60,000,000

The experimental scene:
1.decompres performance test, using the implementation in utils/compressor, the decompression algorithms 0,1,2,3 in lz4 are tested
2.MergeTree Read test,Build a plan with only ReadFromMergeTree in the code.
3.TPCH Q6 test,Build QueryPlan of TPCH Q6 through code and test it.

Summary First

There is a performance degradation of over 10% in 22.3 compared to an earlier version such as 21.4, and the current testing proves that the performance degradation is due to the Lz4 decompression algorithm. Through code discovery, new code changes are made to address some security issues, but the logic here is not complex, and there is a significant performance degradation.

22.3 remove bound check

code of decompressImpl:

template <size_t copy_amount, bool use_shuffle>
bool NO_INLINE decompressImpl(
     const char * const source,
     char * const dest,
     size_t source_size,
     size_t dest_size)
{
    const UInt8 * ip = reinterpret_cast<const UInt8 *>(source);
    UInt8 * op = reinterpret_cast<UInt8 *>(dest);
    const UInt8 * const input_end = ip + source_size;
    UInt8 * const output_begin = op;
    UInt8 * const output_end = op + dest_size;

    /// Unrolling with clang is doing >10% performance degrade.
#if defined(__clang__)
    #pragma nounroll
#endif
    while (true)
    {
        size_t length;

        auto continue_read_length = [&]
        {
            unsigned s;
            do
            {
                s = *ip++;
                length += s;
            } while (unlikely(s == 255 && ip < input_end));
        };

        /// Get literal length.

        const unsigned token = *ip++;
        length = token >> 4;
        if (length == 0x0F)
            continue_read_length();

        /// Copy literals.

        UInt8 * copy_end = op + length;

        /// input: Hello, world
        ///        ^-ip
        /// output: xyz
        ///            ^-op  ^-copy_end
        /// output: xyzHello, w
        ///                   ^- excessive copied bytes due to "wildCopy"
        /// input: Hello, world
        ///              ^-ip
        /// output: xyzHello, w
        ///                  ^-op (we will overwrite excessive bytes on next iteration)

        if (unlikely(copy_end > output_end))
            return false;

        // Due to implementation specifics the copy length is always a multiple of copy_amount
        size_t real_length = 0;

        static_assert(copy_amount == 8 || copy_amount == 16 || copy_amount == 32);
        if constexpr (copy_amount == 8)
            real_length = (((length >> 3) + 1) * 8);
        else if constexpr (copy_amount == 16)
            real_length = (((length >> 4) + 1) * 16);
        else if constexpr (copy_amount == 32)
            real_length = (((length >> 5) + 1) * 32);

        if (unlikely(ip + real_length >= input_end + ADDITIONAL_BYTES_AT_END_OF_BUFFER))
             return false;

        wildCopy<copy_amount>(op, ip, copy_end);    /// Here we can write up to copy_amount - 1 bytes after buffer.

        if (copy_end == output_end)
            return true;

        ip += length;
        op = copy_end;

        if (copy_end >= output_end)
            return true;

        /// Get match offset.

        size_t offset = unalignedLoad<UInt16>(ip);
        ip += 2;
        const UInt8 * match = op - offset;

        /// Get match length.

        length = token & 0x0F;
        if (length == 0x0F)
            continue_read_length();
        length += 4;

        /// Copy match within block, that produce overlapping pattern. Match may replicate itself.

        copy_end = op + length;

        /** Here we can write up to copy_amount - 1 - 4 * 2 bytes after buffer.
          * The worst case when offset = 1 and length = 4
          */

        if (unlikely(offset < copy_amount))
        {
            /// output: Hello
            ///              ^-op
            ///         ^-match; offset = 5
            ///
            /// output: Hello
            ///         [------] - copy_amount bytes
            ///              [------] - copy them here
            ///
            /// output: HelloHelloHel
            ///            ^-match   ^-op

            copyOverlap<copy_amount, use_shuffle>(op, match, offset);
        }
        else
        {
            copy<copy_amount>(op, match);
            match += copy_amount;
        }

        op += copy_amount;

        copy<copy_amount>(op, match);   /// copy_amount + copy_amount - 1 - 4 * 2 bytes after buffer.
        if (length > copy_amount * 2)
            wildCopy<copy_amount>(op + copy_amount, match + copy_amount, copy_end);

        op = copy_end;
    }
}

}

decompress performance

! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:52:16+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 1.20, 0.75, 0.67
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
BM_TestDecompress/0/iterations:50/repeats:6               551 ms          551 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               549 ms          549 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               549 ms          549 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               552 ms          552 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               553 ms          553 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               547 ms          547 ms           50
BM_TestDecompress/0/iterations:50/repeats:6_mean          550 ms          550 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_median        550 ms          550 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_stddev       2.14 ms         2.14 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_80%           553 ms          553 ms            6
BM_TestDecompress/1/iterations:50/repeats:6               667 ms          667 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               694 ms          694 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               668 ms          668 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               670 ms          670 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               677 ms          677 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               670 ms          670 ms           50
BM_TestDecompress/1/iterations:50/repeats:6_mean          674 ms          674 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_median        670 ms          670 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_stddev       10.4 ms         10.4 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_80%           677 ms          677 ms            6
BM_TestDecompress/2/iterations:50/repeats:6               680 ms          680 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               693 ms          693 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               731 ms          731 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               699 ms          699 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               688 ms          688 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               684 ms          684 ms           50
BM_TestDecompress/2/iterations:50/repeats:6_mean          696 ms          696 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_median        691 ms          691 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_stddev       18.7 ms         18.7 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_80%           688 ms          688 ms            6
BM_TestDecompress/3/iterations:50/repeats:6               813 ms          813 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               817 ms          817 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               813 ms          813 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               815 ms          815 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               816 ms          816 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               816 ms          816 ms           50
BM_TestDecompress/3/iterations:50/repeats:6_mean          815 ms          815 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_median        815 ms          815 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_stddev       1.55 ms         1.55 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_80%           816 ms          816 ms            6

MergeTree Read Performance

! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:36:00+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.17, 0.70, 0.86
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
BM_MergeTreeRead/2/iterations:50/repeats:6               499 ms          499 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               499 ms          499 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               505 ms          505 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               498 ms          498 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               500 ms          500 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               498 ms          498 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6_mean          500 ms          500 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_median        499 ms          499 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_stddev       2.68 ms         2.68 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_80%           500 ms          500 ms            6

TPCH Q6 performance

! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:27:40+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 1.99, 1.15, 0.94
-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1149 ms         1149 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1144 ms         1144 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1143 ms         1143 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1140 ms         1140 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1142 ms         1142 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1136 ms         1136 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_mean         1142 ms         1142 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_median       1143 ms         1143 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_stddev       4.34 ms         4.34 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_80%          1142 ms         1142 ms            6

22.3 Revert ALL

code of decompressImpl:

template <size_t copy_amount, bool use_shuffle>
bool NO_INLINE decompressImpl(
     const char * const source,
     char * const dest,
     size_t source_size,
     size_t dest_size)
{
    const UInt8 * ip = reinterpret_cast<const UInt8 *>(source);
    UInt8 * op = reinterpret_cast<UInt8 *>(dest);
    const UInt8 * const input_end = ip + source_size;
    UInt8 * const output_begin = op;
    UInt8 * const output_end = op + dest_size;

    /// Unrolling with clang is doing >10% performance degrade.
#if defined(__clang__)
    #pragma nounroll
#endif
    while (true)
    {
        size_t length;

        auto continue_read_length = [&]
        {
            unsigned s;
            do
            {
                s = *ip++;
                length += s;
            } while (unlikely(s == 255));
        };

        /// Get literal length.

        const unsigned token = *ip++;
        length = token >> 4;
        if (length == 0x0F)
            continue_read_length();

        /// Copy literals.

        UInt8 * copy_end = op + length;

        /// input: Hello, world
        ///        ^-ip
        /// output: xyz
        ///            ^-op  ^-copy_end
        /// output: xyzHello, w
        ///                   ^- excessive copied bytes due to "wildCopy"
        /// input: Hello, world
        ///              ^-ip
        /// output: xyzHello, w
        ///                  ^-op (we will overwrite excessive bytes on next iteration)

        wildCopy<copy_amount>(op, ip, copy_end);    /// Here we can write up to copy_amount - 1 bytes after buffer.

        ip += length;
        op = copy_end;

        if (copy_end >= output_end)
            return true;

        /// Get match offset.

        size_t offset = unalignedLoad<UInt16>(ip);
        ip += 2;
        const UInt8 * match = op - offset;

        /// Get match length.

        length = token & 0x0F;
        if (length == 0x0F)
            continue_read_length();
        length += 4;

        /// Copy match within block, that produce overlapping pattern. Match may replicate itself.

        copy_end = op + length;

        /** Here we can write up to copy_amount - 1 - 4 * 2 bytes after buffer.
          * The worst case when offset = 1 and length = 4
          */

        if (unlikely(offset < copy_amount))
        {
            /// output: Hello
            ///              ^-op
            ///         ^-match; offset = 5
            ///
            /// output: Hello
            ///         [------] - copy_amount bytes
            ///              [------] - copy them here
            ///
            /// output: HelloHelloHel
            ///            ^-match   ^-op

            copyOverlap<copy_amount, use_shuffle>(op, match, offset);
        }
        else
        {
            copy<copy_amount>(op, match);
            match += copy_amount;
        }

        op += copy_amount;

        copy<copy_amount>(op, match);   /// copy_amount + copy_amount - 1 - 4 * 2 bytes after buffer.
        if (length > copy_amount * 2)
            wildCopy<copy_amount>(op + copy_amount, match + copy_amount, copy_end);

        op = copy_end;
    }
}

}

decompress performance

! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T13:08:07+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.16, 0.36, 0.61
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
BM_TestDecompress/0/iterations:50/repeats:6               518 ms          518 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               515 ms          515 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               516 ms          516 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               520 ms          520 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               524 ms          524 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               516 ms          516 ms           50
BM_TestDecompress/0/iterations:50/repeats:6_mean          518 ms          518 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_median        517 ms          517 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_stddev       3.38 ms         3.38 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_80%           524 ms          524 ms            6
BM_TestDecompress/1/iterations:50/repeats:6               476 ms          476 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               476 ms          476 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               477 ms          477 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               480 ms          480 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               480 ms          480 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               478 ms          478 ms           50
BM_TestDecompress/1/iterations:50/repeats:6_mean          478 ms          478 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_median        477 ms          477 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_stddev       1.81 ms         1.81 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_80%           480 ms          480 ms            6
BM_TestDecompress/2/iterations:50/repeats:6               589 ms          589 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               584 ms          584 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               581 ms          581 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               586 ms          586 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               581 ms          581 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               584 ms          584 ms           50
BM_TestDecompress/2/iterations:50/repeats:6_mean          584 ms          584 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_median        584 ms          584 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_stddev       2.83 ms         2.83 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_80%           581 ms          581 ms            6
BM_TestDecompress/3/iterations:50/repeats:6               636 ms          636 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               635 ms          635 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               634 ms          634 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               636 ms          636 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               635 ms          635 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               632 ms          632 ms           50
BM_TestDecompress/3/iterations:50/repeats:6_mean          635 ms          635 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_median        635 ms          635 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_stddev       1.37 ms         1.37 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_80%           635 ms          635 ms            6

MergeTree Read Performance

! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T12:58:30+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 1.91, 1.24, 0.93
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
BM_MergeTreeRead/2/iterations:50/repeats:6               445 ms          445 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               447 ms          447 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               460 ms          460 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               462 ms          462 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               452 ms          452 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               461 ms          461 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6_mean          454 ms          454 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_median        456 ms          456 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_stddev       7.57 ms         7.57 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_80%           452 ms          452 ms            6

TPCH Q6 performance

! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T13:47:10+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 1.54, 1.00, 0.76
-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1093 ms         1093 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1078 ms         1078 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1098 ms         1098 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1079 ms         1079 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1069 ms         1069 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1113 ms         1113 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_mean         1088 ms         1088 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_median       1086 ms         1086 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_stddev       16.0 ms         16.0 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_80%          1069 ms         1069 ms            6

22.3 no changes

branch of 22.3

decompress performance

! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:01:35+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.38, 0.19, 0.13
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
BM_TestDecompress/0/iterations:50/repeats:6               630 ms          627 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               626 ms          626 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               626 ms          626 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               623 ms          623 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               633 ms          633 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               628 ms          628 ms           50
BM_TestDecompress/0/iterations:50/repeats:6_mean          628 ms          627 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_median        627 ms          627 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_stddev       3.32 ms         3.17 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_80%           633 ms          633 ms            6
BM_TestDecompress/1/iterations:50/repeats:6               580 ms          580 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               577 ms          577 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               581 ms          581 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               578 ms          578 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               579 ms          579 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               578 ms          578 ms           50
BM_TestDecompress/1/iterations:50/repeats:6_mean          579 ms          579 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_median        579 ms          579 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_stddev       1.59 ms         1.59 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_80%           579 ms          579 ms            6
BM_TestDecompress/2/iterations:50/repeats:6               800 ms          800 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               802 ms          802 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               801 ms          801 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               833 ms          833 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               810 ms          810 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               798 ms          798 ms           50
BM_TestDecompress/2/iterations:50/repeats:6_mean          807 ms          807 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_median        801 ms          801 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_stddev       13.3 ms         13.3 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_80%           810 ms          810 ms            6
BM_TestDecompress/3/iterations:50/repeats:6               716 ms          716 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               721 ms          721 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               733 ms          733 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               747 ms          747 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               720 ms          720 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               719 ms          719 ms           50
BM_TestDecompress/3/iterations:50/repeats:6_mean          726 ms          726 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_median        720 ms          720 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_stddev       11.6 ms         11.6 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_80%           720 ms          720 ms            6

MergeTree Read Performance

! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T10:19:21+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 2.16, 1.23, 0.97
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
BM_MergeTreeRead/2/iterations:50/repeats:6               548 ms          548 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               521 ms          521 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               521 ms          521 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               522 ms          522 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               518 ms          518 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               513 ms          513 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6_mean          524 ms          524 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_median        521 ms          521 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_stddev       12.1 ms         12.1 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_80%           518 ms          518 ms            6

TPCH Q6 performance

! /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T11:17:27+08:00
Running /home/saber/github/ClickHouse/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 2.03, 1.30, 0.83
-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1190 ms         1190 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1185 ms         1185 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1187 ms         1187 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1176 ms         1176 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1166 ms         1166 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1168 ms         1168 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_mean         1179 ms         1179 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_median       1180 ms         1180 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_stddev       10.2 ms         10.2 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_80%          1166 ms         1166 ms            6

Old Version

base on commit fd29ad7: Merge pull request #26871 from kssenii/rabbit-fix-sink …

decompress performance

!/home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T12:24:01+08:00
Running /home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.40, 0.53, 0.76
---------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations
---------------------------------------------------------------------------------------------
BM_TestDecompress/0/iterations:50/repeats:6               531 ms          531 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               520 ms          520 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               521 ms          521 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               519 ms          519 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               519 ms          519 ms           50
BM_TestDecompress/0/iterations:50/repeats:6               523 ms          523 ms           50
BM_TestDecompress/0/iterations:50/repeats:6_mean          522 ms          522 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_median        521 ms          521 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_stddev       4.37 ms         4.37 ms            6
BM_TestDecompress/0/iterations:50/repeats:6_80%           519 ms          519 ms            6
BM_TestDecompress/1/iterations:50/repeats:6               474 ms          474 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               476 ms          476 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               474 ms          474 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               480 ms          480 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               502 ms          502 ms           50
BM_TestDecompress/1/iterations:50/repeats:6               478 ms          478 ms           50
BM_TestDecompress/1/iterations:50/repeats:6_mean          481 ms          481 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_median        477 ms          477 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_stddev       10.7 ms         10.7 ms            6
BM_TestDecompress/1/iterations:50/repeats:6_80%           502 ms          502 ms            6
BM_TestDecompress/2/iterations:50/repeats:6               582 ms          582 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               584 ms          584 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               583 ms          583 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               579 ms          579 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               584 ms          584 ms           50
BM_TestDecompress/2/iterations:50/repeats:6               583 ms          583 ms           50
BM_TestDecompress/2/iterations:50/repeats:6_mean          583 ms          583 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_median        583 ms          583 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_stddev       1.63 ms         1.63 ms            6
BM_TestDecompress/2/iterations:50/repeats:6_80%           584 ms          584 ms            6
BM_TestDecompress/3/iterations:50/repeats:6               634 ms          634 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               631 ms          631 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               631 ms          631 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               631 ms          631 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               630 ms          630 ms           50
BM_TestDecompress/3/iterations:50/repeats:6               638 ms          638 ms           50
BM_TestDecompress/3/iterations:50/repeats:6_mean          632 ms          632 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_median        631 ms          631 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_stddev       2.84 ms         2.84 ms            6
BM_TestDecompress/3/iterations:50/repeats:6_80%           630 ms          630 ms            6

MergeTree Read Performance

!/home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T12:17:00+08:00
Running /home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 1.62, 1.22, 0.98
--------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations
--------------------------------------------------------------------------------------------
BM_MergeTreeRead/2/iterations:50/repeats:6               436 ms          436 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               440 ms          440 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               437 ms          437 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               436 ms          436 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               441 ms          441 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6               438 ms          438 ms           50
BM_MergeTreeRead/2/iterations:50/repeats:6_mean          438 ms          438 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_median        438 ms          438 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_stddev       2.00 ms         2.01 ms            6
BM_MergeTreeRead/2/iterations:50/repeats:6_80%           441 ms          441 ms            6

TPCH Q6 performance

!/home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
2022-04-19T12:09:54+08:00
Running /home/saber/github/ClickHouse.worktrees/origin/local_engine_with_columnar_shuffle_remove_rebase/build/utils/local-engine/tests/benchmark_local_engine
Run on (12 X 3696 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 0.14, 0.63, 0.78
-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1070 ms         1070 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1071 ms         1071 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1077 ms         1077 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1079 ms         1079 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1071 ms         1071 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6              1074 ms         1074 ms           50
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_mean         1074 ms         1074 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_median       1072 ms         1072 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_stddev       3.94 ms         3.94 ms            6
BM_MERGE_TREE_TPCH_Q6/iterations:50/repeats:6_80%          1071 ms         1071 ms            6

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 8
  • Comments: 22 (21 by maintainers)

Most upvoted comments

#40142 covers the case we observed on TPCH Q6. I found one more case that will be addressed in subsequent pr.