pulsar: [Bug] Brokers restart during read/writes causes data loss
Search before asking
- I searched in the issues and found nothing similar.
Version
-
Affected Pulsar versions:
- 3.0.1
- 3.0.0
-
Tested environments
- GCP Kubernetes Cluster - 3-3-3-4 (Zk-proxy-broker-BK), Qurum 2-2-2
- MicroK8S local cluster - 1-1-3-1 (Zk-proxy-broker-BK), Qurum 1-1-1
Deployments done using official Helm Chart.
Minimal reproduce step
Checkout: https://github.com/websight-io/pulsar-chaos-test/ and follow README.MD
OR:
- Create partitioned (12 partitions) topic, set retention to -1, -1
- Create a producer and start writing to a topic
- Create a consumer with a new subscription set to
Earliest
position and start reading the messages - Restart all Broker Pods at once
- Wait for the brokers to start (2-3 minutes)
- Stop producer created in step 2
- Stop consumer created in step 3
- Create a new consumer with a new subscription set to
Earliest
position and read all the messages
What did you expect to see?
Both consumers read the same number of messages. Topic should contain all the messages that were stored in it.
What did you see instead?
Consumer created after the restart cannot read the same number of messages as the one created before the restart. Topic does not contain all the messages that were stored in it.
Anything else?
- Retention is set before starting the set. It’s preserved and can be verified:
bin/pulsar-admin topics get-retention persistent://websight/dxp/chaos-test
{
"retentionTimeInMinutes" : -1,
"retentionSizeInMB" : -1
}
- Pulsar Manager shows that some of the partitions are empty after restart (or the data is not equally distributed)
- Pulsar Manager state from before the restart:
Are you willing to submit a PR?
- I’m willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 27 (10 by maintainers)
@michalcukierman I have found one potential case that may cause the topic retention policy not work. I will push a fix later.
I changed the repository visibility to public:
Here are the values I am using:
https://github.com/websight-io/pulsar-chaos-test-perf/blob/main/values.yaml
You can find two scripts there I use to reproduce the error.
I will test the settings you mentioned and the namespace level policies tomorrow.
I will also perform the test on Pulsar 2.10.4