kafka-backup: Yet another NPE

Just hit into NPE below yesterday (using commit 3c95089c). Tried today with latest commit from master (f30b9ad9) though it’s still here. Output below is from latest version.

What’s changed. I did migration to eCryptfs. I stopped kafka-backup, renamed target dir, emptied and chattr +i’d backup sink config (to prevent kafka-backup to be started by Puppet again). Then I deployed eCryptfs changes, did rsync back, then un-chattr +i’d it and reapplied Puppet.

Now main question should we try to debug this at all? Or should I just wipe it and do another fresh backup? This is QA so we have some time in case.

[2020-03-17 02:23:47,321] INFO [Consumer clientId=connector-consumer-chrono_qa-backup-sink-0, groupId=connect-chrono_qa-backup-sink] Setting offset for partition [redacted].chrono-billable-datasink-0 to the committed offset FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=kafka5.node:9093 (id: 5 rack: null), epoch=187}} (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:762)
[2020-03-17 02:23:47,697] ERROR WorkerSinkTask{id=chrono_qa-backup-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:179)
java.lang.NullPointerException
        at de.azapps.kafkabackup.sink.BackupSinkTask.close(BackupSinkTask.java:122)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:397)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:591)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
        at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
        at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
[2020-03-17 02:23:47,705] ERROR WorkerSinkTask{id=chrono_qa-backup-sink-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:180)
[2020-03-17 02:23:47,705] INFO Stopped BackupSinkTask (de.azapps.kafkabackup.sink.BackupSinkTask:139)

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 16 (15 by maintainers)

Most upvoted comments

I have seen it today in my test setup too… Currently I am failing to reproduce it. Will try it again during the next days…

Happened today on another cluster too… I have Azure backup cronjob which is stopping kafka-backup, then umounting eCryptfs, then doing azcopy sync, then mounting eCryptfs back and starting kafka-backup. Tonight umount step failed, so script failed too (set -e). I guess this is time when issue happens. Though I need to recheck timeline carefully. Will update this issue later.

UPD. I did logs check just now. NPE happened earlier actually. Kafka-backup was killed by OOM multiple times… It seems -Xmx1024M or Docker memory_limit=1152M is not enough for this cluster 😦 Any ideas about how to calculate HEAP/RAM size for kafka-backup?

Do you want me to do some debugging on this data? I cannot upload it as it’s contains company’s sensitive data…