runtime: Incorrect behavior when TieredCompilation is enabled

Some Orleans users have hit an issue when running on .NET Core 2.1 with TieredCompilation enabled. After investigation, I believe the issue is a JIT bug related to TieredCompilation.

Repro gist: https://gist.github.com/ReubenBond/b5bdf98423d5279a8e9ae63a5e6e8f0a

In the repro, the line return buff[offset]; will return buff[0] when TieredCompilation is enabled and buff[1] (correct) when TieredCompilation is disabled.

byte[] CheckLength(out int offset)
{
    // In the original code, this logic is more complicated.
    // It's simplified here to demonstrate the bug.
    offset = 1;
    return currentBuffer;
}

public byte ReadByte()
{
    int offset;
    var buff = CheckLength(out offset);
    return buff[offset];
}

We have advised the affected users to disable TieredCompilation for now, as a workaround.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 5
  • Comments: 22 (16 by maintainers)

Commits related to this issue

Most upvoted comments

I understand the workaround may be impractical – I looked over the sources but couldn’t tell if the pattern was pervasive or just happened in a few places. The bug requires that exact sequence so I was hopeful it might be something that could be worked around easily.

Telling your users to not enable (2.1) or to disable (2.2, 3.0) tiered compilation for now seems prudent. We should have this fixed shortly in 3.0 and I will propose fixing it in both 2.2 and 2.1.

@kouvel @noahfalk @dotnet/jit-contrib

@sergeybykov Tiered compilation is not enabled by default for 2.1, only for 2.2 and the master branch, specifically to help flush out issues like this that we haven’t encountered in internal testing.

The fix is in master now so it would be great if you all can confirm the fix with an upcoming 3.0 preview build. I believe it takes a day or two for the previews to pick up these changes. I can point you at the right build once this happens.

And then fgSetTreeSeqHelper double-checks that GTF_REVERSE_OPS is set.

Anyways undoing both those fixes the bug and doesn’t cause any pri1 failures and modest minopts diffs (1147 out of ~192K methods). Will PR this fix and the test case in the morning…