pulsar: [Bug] Brokers restart during read/writes causes data loss

Search before asking

  • I searched in the issues and found nothing similar.

Version

  • Affected Pulsar versions:

    • 3.0.1
    • 3.0.0
  • Tested environments

    • GCP Kubernetes Cluster - 3-3-3-4 (Zk-proxy-broker-BK), Qurum 2-2-2
    • MicroK8S local cluster - 1-1-3-1 (Zk-proxy-broker-BK), Qurum 1-1-1

Deployments done using official Helm Chart.

Minimal reproduce step

Checkout: https://github.com/websight-io/pulsar-chaos-test/ and follow README.MD

OR:

  1. Create partitioned (12 partitions) topic, set retention to -1, -1
  2. Create a producer and start writing to a topic
  3. Create a consumer with a new subscription set to Earliest position and start reading the messages
  4. Restart all Broker Pods at once
  5. Wait for the brokers to start (2-3 minutes)
  6. Stop producer created in step 2
  7. Stop consumer created in step 3
  8. Create a new consumer with a new subscription set to Earliest position and read all the messages

What did you expect to see?

Both consumers read the same number of messages. Topic should contain all the messages that were stored in it.

What did you see instead?

Consumer created after the restart cannot read the same number of messages as the one created before the restart. Topic does not contain all the messages that were stored in it.

Anything else?

  • Retention is set before starting the set. It’s preserved and can be verified: bin/pulsar-admin topics get-retention persistent://websight/dxp/chaos-test
{
  "retentionTimeInMinutes" : -1,
  "retentionSizeInMB" : -1
}
  • Pulsar Manager shows that some of the partitions are empty after restart (or the data is not equally distributed) Screenshot 2023-08-09 at 18 40 31
  • Pulsar Manager state from before the restart: Screenshot 2023-08-09 at 18 41 28

Are you willing to submit a PR?

  • I’m willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 27 (10 by maintainers)

Most upvoted comments

I haven’t notice any warn or errors. Please checkout: https://github.com/websight-io/pulsar-chaos-test-perf/

You should be able to recreate the issue by running single SH script. I also think that the screenshot mentioned here may show the sequence of deleting a ledger/applying policy: #20968 (comment)

Please search for “Policies updated successfully”

@michalcukierman I have found one potential case that may cause the topic retention policy not work. I will push a fix later.

I changed the repository visibility to public:

Here are the values I am using:

https://github.com/websight-io/pulsar-chaos-test-perf/blob/main/values.yaml

You can find two scripts there I use to reproduce the error.

I will test the settings you mentioned and the namespace level policies tomorrow.

I will also perform the test on Pulsar 2.10.4