liftbridge: replicator seems lead to high cpu when no any message publish to liftbridge

I created a stream testtlblog with 2 partitions and replica-factor is 2, after running for a while, I stopped the producer, and the liftbridge server and nats server still running. But some times later, I found high cpu occupancy with top command

And also I found massive nats communication is ongoing between liftbridge servers, look likes they are doing some job about replica.

[#25689] Received on [testlb.testtlblog.0.replicate]: ‘�C� server-2��’ [#25690] Received on [_INBOX.FRbHzl6QRyqq2LNMOKbW1m.uHyUJJr7]: ‘�C!]��’ [#25691] Received on [testlb.notify.server-2]: '�C�

testtlblog’ [#25692] Received on [testlb.testtlblog.0.replicate]: ‘�C� server-2��’ [#25693] Received on [_INBOX.FRbHzl6QRyqq2LNMOKbW1m.vGjesjOE]: ‘�C!]��’ [#25694] Received on [testlb.notify.server-2]: '�C�

testtlblog’ [#25695] Received on [testlb.testtlblog.0.replicate]: ‘�C� server-2��’ [#25696] Received on [_INBOX.FRbHzl6QRyqq2LNMOKbW1m.crJLwdFT]: ‘�C!]��’ [#25697] Received on [testlb.notify.server-2]: '�C�

I got the above infomartion by nats-sub tool. testlb is the clustering namespace, and testtlblog is the stream name

These log messages have already been cleaned when retention max age reached, how it happened? I need your help and will appreciate very much

root /opt/liftbridge# ls -l data/server-2/streams/testtlblog/0/ total 8 -rw-r–r-- 1 root root 10485760 Jan 6 18:36 00000000000559806923.index -rw-r–r-- 1 root root 0 Jan 6 18:36 00000000000559806923.log -rw------- 1 root root 17 Jan 6 18:44 leader-epoch-checkpoint -rw------- 1 root root 9 Jan 7 14:43 replication-offset-checkpoint

By the way, my liftbridge server is v1.5.0, golang version is 1.15.5

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21 (9 by maintainers)

Commits related to this issue

Fix replication notify flood leading to high CPU Fix an issue where the replication waiter logic which occurs when a follower is caught up with the leader causes a flood of notify and replication req... — committed to liftbridge-io/liftbridge by tylertreat 3 years ago

Most upvoted comments

Reproducing need more time (about 8 hours):

Cleanup data folder
Create 2 stream with 2 partitions
Subscribe on each 4 partition
Produce message to one of stream (I send 1 message per second)
Wait about 2-8 hours

liftbridge.yml

clustering:
    min.insync.replicas: 2
    raft.bootstrap.peers:
    - mstage-db
    - mstage-wrk
    - mstage-wrk2
    server.id: mstage-db
cursors:
    stream.auto.pause.time: 0
    stream.partitions: 1
data:
    dir: /var/lib/liftbridge
host: 10.10.10.19
logging:
    level: debug
    raft: true
nats.servers:
- nats://10.10.10.19:4222
- nats://10.10.10.20:4222
- nats://10.10.10.21:4222
streams:
    compact.enabled: true
    retention.max:
        age: 1h

aversant on Jan 30, 2021

Going to create a patch release that will include the fix shortly.

tylertreat on Feb 25, 2021

So far I’ve been unable to reproduce unfortunately. Will continue trying…

I got the following:

High CPU usage happens when LB spam message to NATS:

Code when generate message: Replicator caughtUp. When create segment we have: Feb 24 20:05:47 mstage-wrk liftbridge[19222]: time="2021-02-24 20:05:47" level=debug msg="Check if we're caught up%!(EXTRA int64=-1, int64=-1)"

After create stream and put 1 message (Position: 144, LastOffset: 0):

Feb 24 20:05:49 mstage-wrk liftbridge[19222]: time="2021-02-24 20:05:49" level=debug msg="Check if we're caught up%!(EXTRA int64=0, int64=0)"
Feb 24 20:05:49 mstage-wrk liftbridge[19222]: time="2021-02-24 20:05:49" level=debug msg="[NotifyLEO] param leo: 0"
Feb 24 20:05:49 mstage-wrk liftbridge[19222]: time="2021-02-24 20:05:49" level=debug msg="[NotifyLEO] Segment params. Position: 144, LastOffset: 0, MaxBytes: 268435456"

After the passage of time segment.MaxAge Liftbridge create New segment with LastOffset: -1:

Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg=checkAndPerformSplit
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Appending new log segment for /var/lib/liftbridge/streams/test.stream2/0 with base offset 1"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Seal Active Segment"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="[commitlog.Clean] Update the leader epoch offset cache to account for deleted segments"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="[commitlog.Clean] leaderEpochCache.ClearEarliest with segment BaseOffset: %!(EXTRA int64=0)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="ClearEarliest Start%!(EXTRA int64=0, int64=0)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="End Clean%!(EXTRA <nil>)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Check if we're caught up%!(EXTRA int64=0, int64=0)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="[NotifyLEO] param leo: 0"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="[NotifyLEO] Segment params. Position: 0, LastOffset: -1, MaxBytes: 268435456"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Replicator Start GoRoutine%!(EXTRA <-chan struct {}=0xc0005c2fc0)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Replicator End GoRoutine%!(EXTRA <-chan struct {}=0xc0005c2fc0)"

And call method r.partition.log.NotifyLEO(r, leo) -> NotifyLEO with params: [NotifyLEO] param leo: 0, but new segment created with params: Position: 0, LastOffset: -1. With this call WaitForLEO and return on block:

	if s.lastOffset != leo {
		ch := make(chan struct{})
		close(ch)
		return ch
	}

This will be repeated until a new message arrives. Then segment LastOffset set to 0 and the condition will stop being met and call return s.waitForData(waiter, s.position).

The situation occurs when retention used only time (on bytes the probability is low).

I added some debug to code for logging and set segment.MaxAge to 5 m, for speedup reproduce. Config:

streams:
    compact.enabled: false
    retention.max:
        age: 24h

Sorry for the chaotic description. I can explain in more detail and attach logs if needed.

aversant on Feb 24, 2021

So far I’ve been unable to reproduce unfortunately. Will continue trying…

tylertreat on Feb 23, 2021