risingwave: bug: `barrier_inflight_latency` is too high and keeps growing
I’m running NexMark with a self-made query, which consists of one Join and one Agg. What could be the reason that barrier latency keeps growing? It has become > 8min

The query is
create materialized view mv_q6s as
SELECT
MAX(B.price) AS final,
A.id,
A.seller
FROM
auction AS A,
bid AS B
WHERE
A.id = B.auction
and B.date_time between A.date_time and A.expires
GROUP BY
A.id,
A.seller
Is it stuck? – No
/risedev ctl meta pause
After pause, the barrier number starts to drop. So, it’s not stuck. The pause succeed after 324.06s 😂 and the in-flight barriers were cleared
[cargo-make] INFO - Build Done in 324.06 seconds.

So, what could be the reason?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (9 by maintainers)
While the barrier in-flight latency is high, the source throughput is not extremely low in this issue. I suspect this is somehow related to the bufferred messages when backpressure happens. With some experiements, I can conclude that this is indeed the cause and the behaviour changes come from the permit-based backpressure introduced by #6170.
I deployed one CN using
risedev dev fullso we use local channel for exchange. Here are the results:915ac8)2ac0c8d) withMAX_CHUNK_PERMITS = 28682 (default):2ac0c8d) withMAX_CHUNK_PERMITS = 2048:In summary, the default setting in #6170 makes the number of bufferred messages larger than before by default. This leads to more messages being processed between barrier when backpressure happens so the barrier in-flight latency increases. This won’t affect source throughput because the buffer size won’t affect the actor throughput.
IMO, we can decrease the default buffer size (i.e. lower
MAX_CHUNK_PERMITS) as a quick fix and in the future we can use some heuristics to tune it dynamically based on barrier in-flight latency and actor throughput.@wangrunji0408 points out that the data generator will sleep and delay to meet the ratio of 46:3:1, only if we have set the
min.event.gap.in.nsto non-zero. As sleeping for a little time is not accurate in tokio, I set the value to 1000 (then it will sleep at least 1000ns*1024=1ms for each chunk) and tested it… and it seemed to work well?The throughput of the actor 1-4 in my case is ~70000 rows per sec, which is much larger than the result from @hzxa21. 🤔 This is interesting. Do we test under debug profile before?
I can consistenly reproduce this issue with
risedev d full. I observered abnormal streaming actor metrics:Actor Execution Timefor the last fragment (join -> agg -> mv) is consistently at ~1s. This means tha the actor is stuck or almost stuck given that this metric measures the rate of tokio total poll time for the actor (i.e. the max value of this metric should be 1s).I originally suspected this is caused by high storage read latency but the cache hit rate is high and the read tail latency shown in the metrics is not significant. I will continue investigating into the issue.
I’m designing a fine-grained backpressure mechanism to control the join actors’ event consumption from two sides. Will first write a design document for this.
The original test done by me was run against release build.
I guess the main cause is the “debug build”. 🤔 Maybe we should set a smaller initial permits in debug build.
Some ideas:
bidin the same time. This makes the workload not realistic. Will this lead to problems?Great work!!! In short, the messages were piled up in the exchange channels.
I agree that we may need to decrease the
MAX_CHUNK_PERMITS.2048is too small, perhaps we can start from half of the currentMAX_CHUNK_PERMITS?Some other ideas:
estimate_size()forDataChunk, it looks like a better kind of permit than the number of rows.current used permitdivided by thetotal permitmeasures the degree of backpressure, which sounds like a pretty useful metric. The reporting can be done on every barrier.Actor traces: