hyrise: tpc-h compression seems not work well

I want to test the memory usage for tpc-h. use hyriseConsole to run the tpch.

(release)> generate_tpch 5
Generating all TPCH tables (this might take a while) ...
- Loading/Generating tables
- Loading/Generating tables done (54 s 805 ms)
- Encoding tables (if necessary) and generating pruning statistics
-  Encoding 'nation' - encoding applied (375 µs 162 ns)
-  Encoding 'region' - encoding applied (375 µs 872 ns)
-  Encoding 'supplier' - encoding applied (689 ms 556 µs)
-  Encoding 'customer' - encoding applied (1 s 644 ms)
-  Encoding 'partsupp' - encoding applied (2 s 104 ms)
-  Encoding 'part' - encoding applied (2 s 669 ms)
-  Encoding 'orders' - encoding applied (12 s 411 ms)
-  Encoding 'lineitem' - encoding applied (1 min 6 s)
- Encoding tables and generating pruning statistic done (1 min 6 s)
- Adding tables to StorageManager and generating table statistics
-  Added 'nation' (952 µs 576 ns)
-  Added 'region' (1 ms 118 µs)
-  Added 'supplier' (132 ms 589 µs)
-  Added 'customer' (2 s 34 ms)
-  Added 'part' (2 s 910 ms)
-  Added 'partsupp' (10 s 134 ms)
-  Added 'orders' (21 s 328 ms)
-  Added 'lineitem' (1 min 9 s)
- Adding tables to StorageManager and generating table statistics done (1 min 9 s)
- No indexes created as --indexes was not specified or set to false

The memory used :

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 25958 root      20   0  152.1g  12.6g  18432 S   0.0  2.5  29:00.69 hyriseConsole

So the data is about 5G, but the memory usage is 12.6G.

Then I modified the default encoding type from dictionary to lz4 for generate_and_store function.

--- a/src/benchmarklib/abstract_table_generator.cpp
+++ b/src/benchmarklib/abstract_table_generator.cpp
@@ -207,11 +207,12 @@ void AbstractTableGenerator::generate_and_store() {
     for (auto& table_info_by_name_pair : table_info_by_name) {
       const auto& table_name = table_info_by_name_pair.first;
       auto& table_info = table_info_by_name_pair.second;
+      auto encoding_config = EncodingConfig{SegmentEncodingSpec{EncodingType::LZ4}};

       const auto encode_table = [&]() {
         Timer per_table_timer;
         table_info.re_encoded =
-            BenchmarkTableEncoder::encode(table_name, table_info.table, _benchmark_config->en
+            BenchmarkTableEncoder::encode(table_name, table_info.table, encoding_config);
         auto output = std::stringstream{};
         output << "-  Encoding '" + table_name << "' - "
                << (table_info.re_encoded ? "encoding applied" : "no encoding necessary") <<

The memory used :

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 57645 root      20   0  153.4g   4.3g  19200 S   0.0  0.8  94:18.67 hyriseConsole

I want to reduce the memory usage as much as possible. AFAIK many column store database can reach a high compression ratio. So Is there any best practice for memory tuning?

About this issue

Original URL
State: open
Created 3 years ago
Comments: 24

Most upvoted comments

The branch is: https://github.com/hyrise/hyrise/tree/martin/sf1000

You can execute TPC-H SF 1000 the following way: ./cmake-build-release/hyriseBenchmarkTPCH -s 1000 --scheduler --data_preparation_cores 8 --encoding simple__LPCompressionSelection_3148260871.json

The configuration is required to reduce the data set size. It’s committed to the branch. The data preparation cores are used to limit the concurrency of encoding. Otherwise, all available cores would encode the 1000 GB of data, which is too much for a 512 GB system and a server with many cores.

Bouncner on Aug 18, 2022