hyrise: tpc-h compression seems not work well
I want to test the memory usage for tpc-h. use hyriseConsole to run the tpch.
(release)> generate_tpch 5
Generating all TPCH tables (this might take a while) ...
- Loading/Generating tables
- Loading/Generating tables done (54 s 805 ms)
- Encoding tables (if necessary) and generating pruning statistics
- Encoding 'nation' - encoding applied (375 µs 162 ns)
- Encoding 'region' - encoding applied (375 µs 872 ns)
- Encoding 'supplier' - encoding applied (689 ms 556 µs)
- Encoding 'customer' - encoding applied (1 s 644 ms)
- Encoding 'partsupp' - encoding applied (2 s 104 ms)
- Encoding 'part' - encoding applied (2 s 669 ms)
- Encoding 'orders' - encoding applied (12 s 411 ms)
- Encoding 'lineitem' - encoding applied (1 min 6 s)
- Encoding tables and generating pruning statistic done (1 min 6 s)
- Adding tables to StorageManager and generating table statistics
- Added 'nation' (952 µs 576 ns)
- Added 'region' (1 ms 118 µs)
- Added 'supplier' (132 ms 589 µs)
- Added 'customer' (2 s 34 ms)
- Added 'part' (2 s 910 ms)
- Added 'partsupp' (10 s 134 ms)
- Added 'orders' (21 s 328 ms)
- Added 'lineitem' (1 min 9 s)
- Adding tables to StorageManager and generating table statistics done (1 min 9 s)
- No indexes created as --indexes was not specified or set to false
The memory used :
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25958 root 20 0 152.1g 12.6g 18432 S 0.0 2.5 29:00.69 hyriseConsole
So the data is about 5G, but the memory usage is 12.6G.
Then I modified the default encoding type from dictionary to lz4 for generate_and_store function.
--- a/src/benchmarklib/abstract_table_generator.cpp
+++ b/src/benchmarklib/abstract_table_generator.cpp
@@ -207,11 +207,12 @@ void AbstractTableGenerator::generate_and_store() {
for (auto& table_info_by_name_pair : table_info_by_name) {
const auto& table_name = table_info_by_name_pair.first;
auto& table_info = table_info_by_name_pair.second;
+ auto encoding_config = EncodingConfig{SegmentEncodingSpec{EncodingType::LZ4}};
const auto encode_table = [&]() {
Timer per_table_timer;
table_info.re_encoded =
- BenchmarkTableEncoder::encode(table_name, table_info.table, _benchmark_config->en
+ BenchmarkTableEncoder::encode(table_name, table_info.table, encoding_config);
auto output = std::stringstream{};
output << "- Encoding '" + table_name << "' - "
<< (table_info.re_encoded ? "encoding applied" : "no encoding necessary") <<
The memory used :
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
57645 root 20 0 153.4g 4.3g 19200 S 0.0 0.8 94:18.67 hyriseConsole
I want to reduce the memory usage as much as possible. AFAIK many column store database can reach a high compression ratio. So Is there any best practice for memory tuning?
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 24
The branch is: https://github.com/hyrise/hyrise/tree/martin/sf1000
You can execute TPC-H SF 1000 the following way:
./cmake-build-release/hyriseBenchmarkTPCH -s 1000 --scheduler --data_preparation_cores 8 --encoding simple__LPCompressionSelection_3148260871.json
The configuration is required to reduce the data set size. It’s committed to the branch. The data preparation cores are used to limit the concurrency of encoding. Otherwise, all available cores would encode the 1000 GB of data, which is too much for a 512 GB system and a server with many cores.