alluxio: Formatted master incorrectly accepts blocks created before the format due to blockID(containerID) clash

Alluxio Version: What version of Alluxio are you using?

Describe the bug alluxio master will generate block container id from 0, after master format and restart without worker format, the old worker will report its block to alluxio master, not sure the master will accept the block by accident.

To Reproduce Introduced in https://github.com/Alluxio/alluxio/pull/14006#issuecomment-913334686

The new master has generated some blocks with container IDs starting from 0. Then the worker from the previous cluster registers with old blocks (created with the old master, with container ID starting from 0). If you are unlucky, a block on this old worker may have old blockID 0 and the new blockID 0 has been allocated to a totally irrelevant file. The new master, in this case, will mistakenly think this blockID is recognized and accepts this copy from the old worker.

Then if the block lengths do not match (the new 0 vs old 0), the master will throw the error referred in the link.

What’s worse, if the block lengths DO match, the master will think this copy belongs to this totally irrelevant file. Then the Alluxio data is messed up without you noticing!

Expected behavior A clear and concise description of what you expected to happen.

Urgency HIGH

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

@HelloHorizon @ZhuTopher The design doc has passed review and #14258 is in progress.

@HelloHorizon I re-reviewed the design doc on Dec. 14, 2021. Not sure if the doc has been changed since then or not. I believe Jiacheng wanted to take another look at it still?

I haven’t re-reviewed the corresponding PR #14258 since my initial pass, but Jiacheng has been making progress with change requests there.

@jiacheliu3 Just want to add a benefit for changing path scheme to include clusterId. It would be a must if we’d like to support running workers against multiple Alluxio clusters. Other than that, I believe this is much preferable to starting block-Ids from arbitrary numbers. With this protocol, we’d also be sparing Alluxio master from flood of big registrations after format.

Agreed including clusterId into design will benefit supporting workers to be owned by multiple clusters.

On your second point, if you mean starting master with a manually specified container ID, it is really hard for the user to know what containerID to use. It will be better replaced by just giving the user an option when starting the worker to wipe out all tiered storage before starting the worker process, like embedding that into alluxio-workers.sh. These are equally manual.

Let’s not start block-container Ids from System.currentTimeMillis() is what I mean.