liftbridge: replicator seems lead to high cpu when no any message publish to liftbridge

I created a stream testtlblog with 2 partitions and replica-factor is 2, after running for a while, I stopped the producer, and the liftbridge server and nats server still running. But some times later, I found high cpu occupancy with top command image

And also I found massive nats communication is ongoing between liftbridge servers, look likes they are doing some job about replica.

[#25689] Received on [testlb.testtlblog.0.replicate]: ‘�C� server-2���’ [#25690] Received on [_INBOX.FRbHzl6QRyqq2LNMOKbW1m.uHyUJJr7]: ‘�C!]��’ [#25691] Received on [testlb.notify.server-2]: '�C�

testtlblog’ [#25692] Received on [testlb.testtlblog.0.replicate]: ‘�C� server-2���’ [#25693] Received on [_INBOX.FRbHzl6QRyqq2LNMOKbW1m.vGjesjOE]: ‘�C!]��’ [#25694] Received on [testlb.notify.server-2]: '�C�

testtlblog’ [#25695] Received on [testlb.testtlblog.0.replicate]: ‘�C� server-2���’ [#25696] Received on [_INBOX.FRbHzl6QRyqq2LNMOKbW1m.crJLwdFT]: ‘�C!]��’ [#25697] Received on [testlb.notify.server-2]: '�C�

I got the above infomartion by nats-sub tool. testlb is the clustering namespace, and testtlblog is the stream name

These log messages have already been cleaned when retention max age reached, how it happened? I need your help and will appreciate very much

root /opt/liftbridge# ls -l data/server-2/streams/testtlblog/0/ total 8 -rw-r–r-- 1 root root 10485760 Jan 6 18:36 00000000000559806923.index -rw-r–r-- 1 root root 0 Jan 6 18:36 00000000000559806923.log -rw------- 1 root root 17 Jan 6 18:44 leader-epoch-checkpoint -rw------- 1 root root 9 Jan 7 14:43 replication-offset-checkpoint

By the way, my liftbridge server is v1.5.0, golang version is 1.15.5

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Reproducing need more time (about 8 hours): image

  1. Cleanup data folder
  2. Create 2 stream with 2 partitions
  3. Subscribe on each 4 partition
  4. Produce message to one of stream (I send 1 message per second)
  5. Wait about 2-8 hours

image

liftbridge.yml

clustering:
    min.insync.replicas: 2
    raft.bootstrap.peers:
    - mstage-db
    - mstage-wrk
    - mstage-wrk2
    server.id: mstage-db
cursors:
    stream.auto.pause.time: 0
    stream.partitions: 1
data:
    dir: /var/lib/liftbridge
host: 10.10.10.19
logging:
    level: debug
    raft: true
nats.servers:
- nats://10.10.10.19:4222
- nats://10.10.10.20:4222
- nats://10.10.10.21:4222
streams:
    compact.enabled: true
    retention.max:
        age: 1h

Going to create a patch release that will include the fix shortly.

So far I’ve been unable to reproduce unfortunately. Will continue trying…

I got the following:

High CPU usage happens when LB spam message to NATS: image

Code when generate message: Replicator caughtUp. When create segment we have: Feb 24 20:05:47 mstage-wrk liftbridge[19222]: time="2021-02-24 20:05:47" level=debug msg="Check if we're caught up%!(EXTRA int64=-1, int64=-1)"

After create stream and put 1 message (Position: 144, LastOffset: 0):

Feb 24 20:05:49 mstage-wrk liftbridge[19222]: time="2021-02-24 20:05:49" level=debug msg="Check if we're caught up%!(EXTRA int64=0, int64=0)"
Feb 24 20:05:49 mstage-wrk liftbridge[19222]: time="2021-02-24 20:05:49" level=debug msg="[NotifyLEO] param leo: 0"
Feb 24 20:05:49 mstage-wrk liftbridge[19222]: time="2021-02-24 20:05:49" level=debug msg="[NotifyLEO] Segment params. Position: 144, LastOffset: 0, MaxBytes: 268435456"

After the passage of time segment.MaxAge Liftbridge create New segment with LastOffset: -1:

Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg=checkAndPerformSplit
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Appending new log segment for /var/lib/liftbridge/streams/test.stream2/0 with base offset 1"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Seal Active Segment"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="[commitlog.Clean] Update the leader epoch offset cache to account for deleted segments"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="[commitlog.Clean] leaderEpochCache.ClearEarliest with segment BaseOffset: %!(EXTRA int64=0)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="ClearEarliest Start%!(EXTRA int64=0, int64=0)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="End Clean%!(EXTRA <nil>)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Check if we're caught up%!(EXTRA int64=0, int64=0)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="[NotifyLEO] param leo: 0"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="[NotifyLEO] Segment params. Position: 0, LastOffset: -1, MaxBytes: 268435456"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Replicator Start GoRoutine%!(EXTRA <-chan struct {}=0xc0005c2fc0)"
Feb 24 20:11:38 mstage-wrk liftbridge[19222]: time="2021-02-24 20:11:38" level=debug msg="Replicator End GoRoutine%!(EXTRA <-chan struct {}=0xc0005c2fc0)"

And call method r.partition.log.NotifyLEO(r, leo) -> NotifyLEO with params: [NotifyLEO] param leo: 0, but new segment created with params: Position: 0, LastOffset: -1. With this call WaitForLEO and return on block:

	if s.lastOffset != leo {
		ch := make(chan struct{})
		close(ch)
		return ch
	}

This will be repeated until a new message arrives. Then segment LastOffset set to 0 and the condition will stop being met and call return s.waitForData(waiter, s.position).

The situation occurs when retention used only time (on bytes the probability is low).

I added some debug to code for logging and set segment.MaxAge to 5 m, for speedup reproduce. Config:

streams:
    compact.enabled: false
    retention.max:
        age: 24h

Sorry for the chaotic description. I can explain in more detail and attach logs if needed.

So far I’ve been unable to reproduce unfortunately. Will continue trying…