runtime: rhel8 arm64 throws NullReferenceExceptions

In our CI builds, each run on RHEL8 arm64 shows NullReferenceExceptions in the log.

On the same arm64 host with a Fedora 32 VM there are no NullReferenceExceptions. When I build and test on another RHEL8 arm64 machine, NullReferenceExceptions also show up in unexpected places.

Some example stack traces from CI log:

Microsoft.Extensions.Hosting tests

        System.NullReferenceException : Object reference not set to an instance of an object.
        Stack Trace:
          /home/tester/runtime/src/coreclr/src/System.Private.CoreLib/src/System/Array.CoreCLR.cs(521,0): at System.SZArrayHelper.GetEnumerator[T]()
          /home/tester/runtime/src/libraries/System.Linq/src/System/Linq/Single.cs(136,0): at System.Linq.Enumerable.SingleOrDefault[TSource](IEnumerable`1 source, Func`2 predicate)
             at System.Reflection.NetCoreReflectionExtensions.GetConstructor(Type type, BindingFlags bindingAttr, Object binder, Type[] types, Object[] modifiers)
             at Castle.DynamicProxy.Generators.InterfaceProxyWithTargetGenerator.EnsureValidBaseType(Type type)
             at Castle.DynamicProxy.Generators.InterfaceProxyWithTargetGenerator.GenerateCode(Type proxyTargetType, Type[] interfaces, ProxyGenerationOptions options)
             at Castle.DynamicProxy.DefaultProxyBuilder.CreateInterfaceProxyTypeWithoutTarget(Type interfaceToProxy, Type[] additionalInterfacesToProxy, ProxyGenerationOptions options)
             at Castle.DynamicProxy.ProxyGenerator.CreateInterfaceProxyTypeWithoutTarget(Type interfaceToProxy, Type[] additionalInterfacesToProxy, ProxyGenerationOptions options)
             at Castle.DynamicProxy.ProxyGenerator.CreateInterfaceProxyWithoutTarget(Type interfaceToProxy, Type[] additionalInterfacesToProxy, ProxyGenerationOptions options, IInterceptor[] interceptors)
             at Moq.CastleProxyFactory.CreateProxy(Type mockType, IInterceptor interceptor, Type[] interfaces, Object[] arguments)
             at Moq.Mock`1.InitializeInstance()
             at Moq.Mock`1.OnGetObject()
             at Moq.Mock.get_Object()
             at Moq.Mock`1.get_Object()
          /home/tester/runtime/src/libraries/Microsoft.Extensions.Hosting/tests/UnitTests/Internal/HostTests.cs(583,0): at Microsoft.Extensions.Hosting.Internal.HostTests.<>c__DisplayClass22_0.<HostStopAsyncCanBeCancelledEarly>b__3(IServiceCollection services)
          /home/tester/runtime/src/libraries/Microsoft.Extensions.Hosting/src/HostingHostBuilderExtensions.cs(121,0): at Microsoft.Extensions.Hosting.HostingHostBuilderExtensions.<>c__DisplayClass7_0.<ConfigureServices>b__0(HostBuilderContext context, 

System.Linq.Parallel.Tests

        System.NullReferenceException : Object reference not set to an instance of an object.
        Stack Trace:
          /home/tester/runtime/src/libraries/System.Linq.Parallel/src/System/Linq/Parallel/Enumerables/ParallelQuery.cs(104,0): at System.Linq.ParallelQuery`1.Cast[TCastTo]()
          /home/tester/runtime/src/libraries/System.Linq.Parallel/src/System/Linq/ParallelEnumerable.cs(5271,0): at System.Linq.ParallelEnumerable.Cast[TResult](ParallelQuery source)
          /home/tester/runtime/src/libraries/System.Linq.Parallel/tests/QueryOperators/CastTests.cs(105,0): at System.Linq.Parallel.Tests.CastTests.Cast_Empty(Labeled`1 labeled, Int32 count)

System.Text.Json.Serialization.Tests

        System.NullReferenceException : Object reference not set to an instance of an object.
        Stack Trace:
          /home/tester/runtime/src/libraries/System.Text.Json/tests/Serialization/SerializationWrapper.cs(104,0): at System.Text.Json.Serialization.Tests.SerializationWrapper.WriterSerializerWrapper.SerializeWrapper[T](T value, JsonSerializerOptions options)
          /home/tester/runtime/src/libraries/System.Text.Json/tests/Serialization/PolymorphicTests.cs(125,0): at System.Text.Json.Serialization.Tests.PolymorphicTests.ArrayAsRootObject()
          --- End of stack trace from previous location ---

@janvorli I don’t know how to debug this, can you take a look? or give me some pointers?

cc @omajid

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 61 (60 by maintainers)

Most upvoted comments

Is this something we should investigate further?

I’ve just hit that on Apple Silicon arm64 too, so it doesn’t seem to be related to rhel 8. It might be something new on arm64 in general, I’ll see if it repros on my Odroid N2.

Does your RHEL 8 installation have page size different than 4kB?

Yes, it has 64kB pages. My Fedora arm64 machine, which doesn’t give NullReferenceExceptions, has 4kB pages.

The failures with asserts (pMDReal != NULL) || !pCF->IsFrameless() are something we have not seen before. The issue with Assertion failed '!"Instruction cannot be encoded: IF_DI_2A"' in 'BigFrames.Test:Test1(int)' during 'Generate code' (IL size 23715) was recently hit during macOS arm64 bringup and it was caused by incorrect handling of OS page size in JIT in the stack probing code generation. See #42023. Does your RHEL 8 installation have page size different than 4kB?

@omajid, @tmds I have tried to run all coreclr pri 1 tests on RHEL 8 with 64kB page size using the latest main and no tests were failing with NullReferenceException anymore. I had to run the tests manually (enumerating all of the related .sh files and running them with added -coreroot argument), since the Preview 6 SDK / runtime that’s normally used to execute xunit doesn’t have the fix for the GS cookie mapping issue that I’ve fixed recently by switching to the lld linker. Out of all the coreclr pri 1 tests, 10052 succeeded, 29 failed and 3 timed out. 15 of the failures are Unhandled exception. System.InvalidProgramException: Vararg calling convention not supported., few were caused by the testing methodology (some tests can properly run only via xunit) and the remaining failures are of unknown kind (but no crashes, just error codes meaning the test didn’t pass as expected). So I am closing this issue.

@tmds I believe the issue doesn’t occur if you have 4kB large memory pages, only when the distro has larger pages, the block with the cookie “leaks” into code.

@janvorli I have a RHEL8 ARM64 machine that I’m setting up for CI that repros the NullReferenceException - would taking a look at that help you? Or I can try out any other prospective fixes.

Without the priority1 option, you’ve run just the priority 0 tests, which is just a fraction of all the 10000+ tests. As for the failures, these same tests keep failing on my local Ubuntu 16.04 repo too, so these are not indications of any RHEL 8 specific issue. The -100 exit code means timeout. So I’d recommend trying to run the pri1 tests to get a better coverage. And running in docker container is also a good way to prove whether the issue you were seeing is in the kernel or in the repo shared libraries, since docker shares the same kernel.

I ran a few experiments which suggest the issue is in the rhel8 kernel.