risingwave: perf: kafka source performance regression due to tracing

To verify if it is the environment’s impact, we re-run a nightly-20231001 image as the last data point, its performance is good.

The first drop happens between nightly-20231008 and nightly-20231009, the second drop happens between nightly-20231012 and nightly-20231015

commits between 1008 and 1009


## nightly-20231009
- 5ae9a360daeeefba400e44d066c76e2bc2526ff0 chore: Update Slack link to a permanent one in README (#12700) https://github.com/risingwavelabs/risingwave/pull/12700
- 2fef0544a7ec4c4fe9921e35f421793c6e68b29a fix: add "ts_ms" in source (#12703) https://github.com/risingwavelabs/risingwave/pull/12703
- 0c4013cc3a95caa465c51028c5d80c1f33ca413b chore(deps): bump etcd-client to v0.12.1 (#12705) https://github.com/risingwavelabs/risingwave/pull/12705
- 9db52aab732b4a851ec397e91c2182b7f5e58056 ci: fix disk out of space for GHA (attempt 2) (#12671) https://github.com/risingwavelabs/risingwave/pull/12671
- bfdafe1b65741b11c8d998b052c1ae1bbff697dd refactor(common): refine `Row` changes in #11070 (#12652) https://github.com/risingwavelabs/risingwave/pull/12652
- da5099bbcc1a26670d7e040dd1b65070f03e1652 perf(stream agg): flush dirty agg groups according to heap size (#12453) https://github.com/risingwavelabs/risingwave/pull/12453
- 2a7ef0d808f17a0e82ee72e6376e7f77c2ba07aa fix(log-store): handle unaligned epoch after recovery (#12407) https://github.com/risingwavelabs/risingwave/pull/12407
- 456f3e98d7a00e3c383f8b03ee40b5da09af58d9 chore(storage): upgrade write limit (#12658) https://github.com/risingwavelabs/risingwave/pull/12658
- 54e8c759a295ff9ac6888b34520bdb1c48a6c363 chore: Update Slack link in README.md (#12697) https://github.com/risingwavelabs/risingwave/pull/12697
- b4f357c22d825a6caad31011e718ab88c41f1bc2 feat(connector): structural logging for parsing error (#12659) https://github.com/risingwavelabs/risingwave/pull/12659
- fbd54d89b05a27f9f8da803e0e29d3b0c4b2e366 fix(storage): fix block format comment (#11664) https://github.com/risingwavelabs/risingwave/pull/11664
- da1e7bed1a6eaf2180ec4f5c57d189227ff08a99 chore: update cherry pick action version (#12692) https://github.com/risingwavelabs/risingwave/pull/12692
- 31cf3f8f1b827fad5572a6a188507a01f5b396ba fix(storage): fix compact_fast_runner_bytes unused (#12655) https://github.com/risingwavelabs/risingwave/pull/12655
- fcdc3a112bb64ac151d44cc48f0bb143d92d456d perf(expr): avoid parsing unit every time on `extract` and `date_part` (#12398) https://github.com/risingwavelabs/risingwave/pull/12398
- ca2509546d056c863752bd4b7568cf0aa5559fe6 chore(deps): Bump byteorder from 1.4.3 to 1.5.0 (#12666) https://github.com/risingwavelabs/risingwave/pull/12666
- a8b25ac3133c1f65bffa467bcfc884b949cca23d fix(doc): fix debian-like installation guide (#12684) https://github.com/risingwavelabs/risingwave/pull/12684
- 8b082dea901084c5f9a6462789e18f77dbe3efc0 chore(deps): Bump similar from 2.2.1 to 2.3.0 (#12668) https://github.com/risingwavelabs/risingwave/pull/12668

## nightly-20231008

We notice that q0 is a stateless query that does no computation and no state access, so likely to be some issue related to parsing I guess?

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 20 (20 by maintainers)

Most upvoted comments

First regression is #12659 cc @BugenZhao

the 2 points are before/after #12659

cpu

xxchan on Oct 24, 2023

Just confirmed that the second drop is caused by rdkafka bump

might take a look at https://github.com/confluentinc/librdkafka/blob/master/CHANGELOG.md and https://github.com/fede1024/rust-rdkafka/blob/master/changelog.md

387d251b722a65dd7147b04aff611a2381c44bf4’s performance is better

note:
## nightly-20231015
- b441341d3ced9cb155de2a1b9a5dabe67a9f7c2b chore(deps): bump rdkafka to v0.34.0 and librdkafka to v2.2.0 (#11232) https://github.com/risingwavelabs/risingwave/pull/11232
- 387d251b722a65dd7147b04aff611a2381c44bf4 feat(optimizer): optimize join stream key (#12831) https://github.com/risingwavelabs/risingwave/pull/12831
- ...

Why upgrading rdkafka lowers CPU usage, and also lowers throughput?

xxchan on Oct 24, 2023

The flamegraph can be downloaded or directly examined in the Buildkite URL, under the artifacts tab.

lmatz on Oct 23, 2023

In summary, the RisingWave runtime CPU percentage decreases(68.58%->62.54%) and the RDKafka runtime CPU percentage increases(14.03%->17.04%) 🤔

st1page on Oct 19, 2023

I cannot observe anything useful from the flame graphs provided above, perhaps the files for 10/08 and 10/09 are incorrect?

However, in #13073 I can clearly find tracing::info_span! introduced in #12659 leads to an overhead of 30% (😨).

https://github.com/risingwavelabs/risingwave/blob/1ca0a8030cf631886e599894d688d1768501b2a2/src/connector/src/parser/mod.rs#L540-L544

(27% parse, 19% chunk building, 30% tracing)

Why it takes so much CPU time:

The fmt layer in tracing-subscriber eagerly formats the given fields on span creation, possibly because it only receives references to the fields. However, the span is only used to provided more context when a parsing error occurs, which means that in most of time this work is wasted.

https://docs.rs/tracing-subscriber/0.3.17/src/tracing_subscriber/fmt/fmt_layer.rs.html#797-812

BugenZhao on Oct 26, 2023

perf: kafka source performance regression due to tracing

So what about #11232?

According to

I tried to revert https://github.com/risingwavelabs/risingwave/pull/12659 on latest main (xxchan/revert-12659), in order to see what’s the rdkafka’s impact on its own, but it seems this also cancelled the impact of rdkafka bump. 🤡

I’m not sure whether we need to investigate rdkafka further. 🤡

Let’s try to fix tracing first and see.

xxchan on Oct 26, 2023

I cannot observe anything useful from the flame graphs provided above, perhaps the files for 10/08 and 10/09 are incorrect?

Just noticed that the CPU flamegraphs are actually exchanged.

1008 (actually 1009)'s tracing takes more CPU. And it actually has more samples (rdkafka’s samples unchanged, rw’s samples increased).

1009 (actually 1008)

xxchan on Oct 26, 2023

First regression is #12659 cc @BugenZhao

But why the flamegraph shows that risingwave takes less CPU in 1009? 😕

BugenZhao on Oct 25, 2023