ClickHouse: Possible memory leak

ClickHouse version: 21.6.6.51 (cannot update until https://github.com/ClickHouse/ClickHouse/issues/32744 is released).

Memory usage is increasing constantly and, when the RSS limit is reached, CH fails random queries (inserts into/selects from different tables). We cannot find where the memory is.

Additional information:

  • All queries are done by users with max_memory_usage=750MiB.
  • We insert rows in batches every 100ms into 32 Buffer tables, totally ~50k rows/s.
  • We use live views and don’t use dictionaries.

Metrics

Metrics (sharp drops are restarts). Here we set max_server_memory_usage_to_ram_ratio to 0.3 (37GiB) to show peaks. The same behavior remains with 0.9 (114GiB). 2021-12-30_08-38-15

Suspiciously high values of purged metrics.

Don’t see any correlation with different events: 2021-12-30_09-26-08

Attempts to find the reason

SELECT formatReadableSize(sum(total_bytes)) FROM system.tables WHERE engine IN ('Memory','Set','Join');
172.57 KiB

SELECT formatReadableSize(sum(memory_usage)) FROM system.merges;
249.99 MiB

SELECT formatReadableSize(sum(memory_usage)) FROM system.processes;
36.97 MiB

SELECT formatReadableSize(sum(memory_usage)), formatReadableSize(sum(peak_memory_usage)) FROM system.processes;
0.00 B                                │ 0.00 B

SELECT
    metric,
    formatReadableSize(value)
FROM system.asynchronous_metrics
WHERE metric IN ('UncompressedCacheBytes', 'MarkCacheBytes')
┌─metric─────────────────┬─formatReadableSize(value)─┐
│ UncompressedCacheBytes │ 0.00 B                    │
│ MarkCacheBytes         │ 55.91 MiB                 │
└────────────────────────┴───────────────────────────┘

SELECT
    formatReadableSize(sum(primary_key_bytes_in_memory)) AS primary_key_bytes_in_memory,
    formatReadableSize(sum(primary_key_bytes_in_memory_allocated)) AS primary_key_bytes_in_memory_allocated
FROM system.parts
┌─primary_key_bytes_in_memory─┬─primary_key_bytes_in_memory_allocated─┐
│ 337.63 MiB                  │ 624.06 MiB                            │
└─────────────────────────────┴───────────────────────────────────────┘

Logs

Regularly (1-2 times per hour) wee see the following pattern in logs:

2021.12.28 10:22:33.397711 [ 1854971 ] {1822bef8-7c15-42d4-9113-a1bce442a4c0} <Error> executeQuery: Code: 24, e.displayText() = DB::Exception: Cannot write to ostream at offset 59798 (version 21.6.6.51 (official build)) (from 10.221.8.220:46046) (in query: WATCH lv_25bed493f7b3a63aade737679c0974f3f320bd99 FORMAT JSONEachRowWithProgress ), Stack trace (when copying this message, always include the lines below):
2021.12.28 10:22:33.400234 [ 1854971 ] {1822bef8-7c15-42d4-9113-a1bce442a4c0} <Error> DynamicQueryHandler: Code: 24, e.displayText() = DB::Exception: Cannot write to ostream at offset 59798, Stack trace (when copying this message, always include the lines below):
2021.12.28 10:22:33.400944 [ 1854971 ] {1822bef8-7c15-42d4-9113-a1bce442a4c0} <Error> DynamicQueryHandler: Cannot send exception to client: Code: 24, e.displayText() = DB::Exception: Cannot write to ostream at offset 59871, Stack trace (when copying this message, always include the lines below):
2021.12.28 10:22:33.402017 [ 1854971 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 32, e.displayText() = I/O error: Broken pipe, Stack trace (when copying this message, always include the lines below):

This may be related to the client library (maybe doesn’t read the whole response).

LiveView is simple enough:

CREATE LIVE VIEW algo.lv_17cc0dbf4232839381bb35d6b65ee11f8b51b1bb WITH TIMEOUT 5
(`now64(9)` DateTime64(9)) AS
SELECT now64(9) FROM ... WHERE some_id IN
(
    SELECT some_id
    FROM subscribers
)
LIMIT 1

What can I check else?

I haven’t found related issues, except for https://github.com/ClickHouse/ClickHouse/issues/32750, but we don’t use cancel_http_readonly_queries_on_client_close.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 15 (12 by maintainers)

Most upvoted comments

That’s another problem (I’ve faced it in https://github.com/ClickHouse/ClickHouse/issues/22996), likely not related here, but also inconvenient.