runtime: `Monitor.TryEnter` sometimes fails even if `Monitor.IsEntered` returns `true`

Description

In a large code base, we have a snippet like this:

object myMutex = new object();

void Foo()
{
    if (Monitor.IsEntered(myMutex))
    {
        if (!Monitor.TryEnter(myMutex))
            LogError("Fail"); // #1
    }
}

Theortically, line #1 should never be visited, right? But we observe that it actually gets visited sometimes.

Initially, we weren’t checking Monitor.TryEnter inside of Monitor.IsEntered, of course, and this code was added for diagnostic purposes after we’ve noticed deadlocks related to this piece of code: the thread holding the monitor was receiving false from Monitor.TryEnter, despite it should theoretically always succeed and return true.

Reproduction Steps

This was initially reported by the user of our product, and we were able to track it down to the described code.

We are occasionally able to trigger the code to enter this line locally, during a debug session, but so far were unable to prepare an isolated MCVE. We are able to share more code if this helps.

Expected behavior

Monitor.TryEnter should always succeed if the calling thread is holding the lock (i.e. if Monitor.IsEntered returns true).

Actual behavior

Monitor.TryEnter occasionally returns false despite the calling thread holding the lock (i.e. if Monitor.IsEntered returns true).

Regression?

No response

Known Workarounds

Replace a call to Monitor.TryEnter(myMutex) with the following block:

if (Monitor.IsEntered(myMutex))
  Monitor.Enter(myMutex);

Surprisingly, this works so far.

Configuration

  • .NET version: .NET 6.0.8
  • OS: Windows 10
  • Architecture: x64

Other information

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 39 (22 by maintainers)

Commits related to this issue

Most upvoted comments

@Mr0N The enters/exits have to be paired. The general concept is known as recursive mutex. If you enter it twice within the same thread then it just increments an internal counter since you know it’s already locked.

As for the actual product bug… .NET / CoreCLR uses an optimization called “thin locks”. Under specific set of circumstances (eg. no previous call to GetHashCode) there is a fast code path for Monitor.Enter/TryEnter/Exit as long as there is no contention on the lock. It simply updates flags in the object header to record the state of the lock. Once a contention happens though (ie. one thread has already acquired the lock through Monitor.Enter and another thread tries to acquire the same lock through Monitor.Enter) the “thin lock” optimization is abandoned and it transitions into a new state where the lock is tracked through indirect entry in a “sync table”. Under normal circumstances this optimization should be entirely unobservable. Unfortunately it was not…

Consider the following scenario:

  • Thread 1 enters the lock through Monitor.Enter, the lock is still represented as thin lock
  • Thread 2 tries to acquire the same lock. It realizes that it is acquired and represented as “thin lock”. It starts the transition from “thin lock” into “full lock” representation.
  • Thread 1 calls Monitor.TryEnter precisely at the same moment as Thread 2 is in middle of the transition. Instead of waiting for the transition to complete (in thread 2) it incorrectly bails out and returns false

Why does calling GetHashCode make a difference you may ask? Because the object header is used by multiple different optimizations and they are mutually exclusive. It either stores the thin lock, cached hash code, or an indirect pointer to the “sync table” (which can hold a full lock, hash code, and several other things).

That’s indeed really weird. I can confirm that the offending thread entered the if (!Monitor.TryEnter(myMutex)) block despite owning the lock:

0:053> !syncblk
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
  814 000001FB2E3F4B68            5         1 000001FB2FD7A190 1d38  53   000001faa154b288 System.Object

I saw nothing wrong in the disassembly, so that sounds like an issue in Monitor.TryEnter. There are two other threads competing for the lock so I wonder if that’s part of the story 🤔

Thanks for the dump, from that it looks like the sync block has transitioned already and based on the state of it at the time of the dump, I didn’t see any way for TryEnter to return false. So likely it’s a race during transition.

Do you mean the transition from thin-lock to syncblock? Based on that assumption I managed to repro the issue:

class Program
{
    private static volatile object _mutex;

    static async Task Main()
    {
        _mutex = new();

        for (int i = 0; i < 8; i++)
        {
            var t = new Thread(static () =>
            {
                while (true)
                {
                    var m = _mutex;

                    Monitor.Enter(m);

                    if (!Monitor.TryEnter(m))
                    {
                        Debugger.Break();
                    }

                    Monitor.Exit(m);
                    Monitor.Exit(m);

                    _mutex = new();
                }
            });

            t.Start();
        }

        Console.WriteLine("Running");
        Console.ReadLine();
}

I was not able to reproduce the issue without recreating a new object every time (so very likely an issue during the promotion of the thin lock).

Thanks for the dump, from that it looks like the sync block has transitioned already and based on the state of it at the time of the dump, I didn’t see any way for TryEnter to return false. So likely it’s a race during transition.

https://github.com/dotnet/runtime/blob/215b39abf947da7a40b0cb137eab4bceb24ad3e3/src/coreclr/System.Private.CoreLib/src/System/Threading/Monitor.CoreCLR.cs#L84 Maybe the problem is that the object is somehow unavailable for 0 nanoseconds, for example, the second thread is reading it

You’re describing a re-entrant lock. They’re almost always a bad idea, and you should attempt to adjust things so that you don’t attempt to re-take the lock.

This is a very strong statement. I don’t see what is wrong with using the Monitor API as a reentrant lock if it advertises itself as such.

It’s a blocking API, so it’s going to wait until it can take the lock, whereas TryEnter doesn’t. You may be opening yourself up to deadlocks at some point, though.

Why would Enter block or cause a deadlock on the thread that is already holding the lock?

All this aside, my initial guess is that you’re using Monitor in some async code.

No, the code is not async.

If you manage to capture a memory dump when that happens during a debug session, it could help understanding what’s happening.

So we need to collect a process dump when the process gets stopped on the breakpoint at line #1, is that correct?

@Mr0N Everybody would really appreciate if you stopped spamming this issue. @kevingosse @kouvel Thanks a lot for the quick investigation and the fix!

@ForNeVeR Just a thought: as an alternative workaround, you can call GetHashCode on the objects used for locking after allocating them:

    public CppPriorityMutex(CppPriorityMutex.TagT tag)
    {
      myMutex = new object();
      _ = myMutex.GetHashCode();
      myPrioritisedMutex = new object();
      _ = myPrioritisedMutex.GetHashCode();
      myLockRequestPriority = CppPriorityMutex.Priority.Low;
    }

This will force the promotion to a syncblock as soon as a thread tries locking those objects (instead of waiting for the first moment the lock is contended). In my tests this seems to fix the race condition.

We are occasionally able to trigger the code to enter this line locally, during a debug session, but so far were unable to prepare an isolated MCVE. We are able to share more code if this helps.

If you manage to capture a memory dump when that happens during a debug session, it could help understanding what’s happening.

Reminds me that issue I faced with frozen segments recently - GC could collect a sync block from a live object and potentially allocate a new syncblock on that place for a different object resulting that live object started pointing to a different sync object. Just in case if it potentially can be a similar issue.

It’s not a problem, it’s just a volatile field

That’s completely irrelevant. You can use Interlocked operations instead if that makes you feel better:

class Program
{
    private static object _mutex;

    static async Task Main()
    {
        Interlocked.CompareExchange(ref _mutex, new object(), null);
        for (int i = 0; i < 32; i++)
        {
            var t = new Thread(static () =>
            {
                while (true)
                {
                    var m = Interlocked.CompareExchange(ref _mutex, null, null);

                    Monitor.Enter(m);

                    if (!Monitor.TryEnter(m))
                    {
                        Debugger.Break();
                    }

                    Monitor.Exit(m);
                    Monitor.Exit(m);

                    _mutex = Interlocked.CompareExchange(ref _mutex, new object(), m);
                }
            });

            t.Start();
        }

        Console.WriteLine("Running");
        Console.ReadLine();
}

Thanks for the dump, from that it looks like the sync block has transitioned already and based on the state of it at the time of the dump, I didn’t see any way for TryEnter to return false. So likely it’s a race during transition.

Do you mean the transition from thin-lock to syncblock? Based on that assumption I managed to repro the issue:

class Program
{
    private static volatile object _mutex;

    static async Task Main()
    {
        _mutex = new();

        for (int i = 0; i < 8; i++)
        {
            var t = new Thread(static () =>
            {
                while (true)
                {
                    var m = _mutex;

                    Monitor.Enter(m);

                    if (!Monitor.TryEnter(m))
                    {
                        Debugger.Break();
                    }

                    Monitor.Exit(m);
                    Monitor.Exit(m);

                    _mutex = new();
                }
            });

            t.Start();
        }

        Console.WriteLine("Running");
        Console.ReadLine();
}

I was not able to reproduce the issue without recreating a new object every time (so very likely an issue during the promotion of the thin lock).

It’s not a problem, it’s just a volatile field

Thanks for the dump, from that it looks like the sync block has transitioned already and based on the state of it at the time of the dump, I didn’t see any way for TryEnter to return false. So likely it’s a race during transition.

@Mr0N Monitor.IsEntered is documented to return true if the current thread is holding the lock.