pythonnet: Random Access Violations on shutdown
Environment
- Pythonnet version: 3.0.0.post1
- Python version: 3.8.10
- Operating System: Windows 11
- .NET Runtime: .NET Framework 472
Details
-
Describe what you were trying to get done.
I am getting random access violations during shutdown in pythonnet. The issue can be re-produced with a C# function as simple as
public static int Add(int a, int b) => a + b;
I get the access violation; however, it is not deterministic at all. Some times, it succeeds as expected, other times I get this. Note that I am upgrading from pythonnet 2.5.2 to pythonnet 3.0.0.post1 so this code used to work. All I am doing is invoking this code from python. I am running this python code that uses pythonnet in pytest. I haven’t seen this when we were using pythonnet 2.5.2.
- If there was a crash, please include the traceback here.
Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at Python.Runtime.Runtime.PyObject_TYPE(BorrowedReference op)
at Python.Runtime.Runtime.NullGCHandles(IEnumerable`1 objects)
at Python.Runtime.Runtime.TryCollectingGarbage(Int32 runs, Boolean forceBreakLoops)
at Python.Runtime.Runtime.Shutdown()
at Python.Runtime.PythonEngine.Shutdown()
at Python.Runtime.Loader.Shutdown(IntPtr data, Int32 size)
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 35 (11 by maintainers)
Yes, please add an option to disable the shutdown code completely. I have had cases where simply
Ctrl+C
ing the process caused a .NET stack trace to be printed, which should not be the case. I have tried monkey-patching thepythonnet.unload
function, but it made everything worse, that is, even more annoying stack traces printed to the console.Would this be an appropriate fix?
The fix is simple: remove a possible reference to the currently disposed object from the
reflectedObjects
set. If the item does not exist the Remove is a no-op.I am not entirely sure why a variable
copyForException
is not used in the catch block.This is a use-after-free; and switching the allocator only changes the likelyhood of the problem resulting in a crash.
I’ve debugged the case with
-Xdev
a bit, and what’s happening is that a Python object is getting freed because its last reference is released by:Due to the use of a debug allocator, this overwrites the ob_type field of the freed object with a
0xdddddddd
bit pattern.After that,
NullGCHandles(CLRObject.reflectedObjects);
is called. But the reflectedObjects still contain a pointer to the deleted PyObject, soPyObject_TYPE
is a use-after-free and retrieves the invalid PyTypeObject* 0xdddddddd. Attempting to dereference that to get at thetp_clr_inst
field results in the crash:With the default allocator (without -Xdev; i.e. pymalloc), I’ve also observed cases where
PyObject_TYPE
crashes directly, but those are more tricky to reproduce. That likely happens because the freed object was the last in its arena, so pymalloc returned the memory to the system. But it is likely that the same bug is the cause for both.Switching to PYTHONMALLOC=malloc just means the memory is more likely to stay around long enough to avoid the crash, making the use-after-free somewhat harmless. But with the default pymalloc allocator, it’s likely to cause real crashes; and with the debug allocator it’s basically guaranteed to crash.
@filmor Sure thing. The C# code I am calling from Python is distributed in a DLL shipped with a commercial product, so I am afraid I cannot include it in my report. My Python code, on the other hand, is very simple and just calls the C# code as instructed by the software vendor. Is there anything else (an output, some profiling results, etc.) I could provide you with that would be useful?
@filmor I was wondering if there’s any update on this issue or if there’s anything we can do to work around this issue. Would you suggest we force using the default allocator?
@filmor Ok I’ve realized that settings PYTHONMALLOC=malloc made the error disappear? So, maybe should be used this instead of default?