runtime: A self-built crossgen2 hangs on some recent Linux distributions

Description

I don’t fully understand everything, but wanted to file this bug to raise awareness and get tips on how to narrow down the issue.

I am using source-build on Fedora 35 to build all of .NET 6. In this environment, the source-built crossgen2 hangs when trying to build parts of ASP.NET Core.

ps shows that this command has been running since last evening, without any progress:

omajid 138650 0.0 0.2 3636216 81968 pts/1 Sl+ Nov15 0:00 dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/package-cache/microsoft.netcore.app.crossgen2.linux-x64/6.0.0/tools/crossgen2 --targetarch:x64 --targetos:linux -O @dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/src/artifacts/obj/Microsoft.AspNetCore.App.Runtime/Release/net6.0/linux-x64/crossgen/PlatformAssembliesPathsCrossgen2.rsp --perfmap --perfmap-format-version:1 --perfmap-path:dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/src/artifacts/bin/Microsoft.AspNetCore.App.Runtime/Release/net6.0/linux-x64/ -o:dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/src/artifacts/bin/Microsoft.AspNetCore.App.Runtime/Release/net6.0/linux-x64/Microsoft.Extensions.Caching.Abstractions.dll dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/package-cache/microsoft.extensions.caching.abstractions/6.0.0/lib/netstandard2.0/Microsoft.Extensions.Caching.Abstractions.dll

A sha256sum lets me confirm that this is the same bit-by-bit identical crossgen2 built from the runtime repo (and not one fetched from a nuget package):

$ sha256sum $(find -iname crossgen2 -type f)
76119b25d2b83fe97f9e7cc16848d5cbe36962c077668b4dd14e5c5d3fd746ed  ./dotnet-9e8b04bbff820c93c142f99a507a46b976f5c14c-x64-bootstrap/src/runtime.4822e3c3aa77eb82b2fb33c9321f923cf11ddde6/artifacts/source-build/self/src/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net6.0/fedora.35-x64/output/crossgen2
76119b25d2b83fe97f9e7cc16848d5cbe36962c077668b4dd14e5c5d3fd746ed  ./dotnet-9e8b04bbff820c93c142f99a507a46b976f5c14c-x64-bootstrap/src/runtime.4822e3c3aa77eb82b2fb33c9321f923cf11ddde6/artifacts/source-build/self/src/artifacts/bin/coreclr/Linux.x64.Release/crossgen2/crossgen2
76119b25d2b83fe97f9e7cc16848d5cbe36962c077668b4dd14e5c5d3fd746ed  ./dotnet-9e8b04bbff820c93c142f99a507a46b976f5c14c-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/package-cache/microsoft.netcore.app.crossgen2.linux-x64/6.0.0/tools/crossgen2

As a point of comparison, this works fine on Fedora 34.

Relevant versions of packages that I am using:

$ rpm -q clang llvm lldb lttng-ust gcc make 
clang-13.0.0~rc1-1.fc35.x86_64
llvm-13.0.0~rc1-1.fc35.x86_64
lldb-13.0.0~rc1-1.fc35.x86_64
lttng-ust-2.12.2-5.fc35.x86_64
gcc-11.2.1-1.fc35.x86_64
make-4.3-6.fc35.x86_64

Also tested with the final (non-rc) clang packages: clang-13.0.0-3.fc35.x86_64 llvm-13.0.0-4.fc35.x86_64, and the result is the same.

@alucryd suggested this might be a clang 13 issue: https://github.com/dotnet/source-build/issues/2602#issuecomment-968293263

This was also observed by another user on Fedora 35: https://pagure.io/dotnet-sig/dotnet6.0/issue/1

Reproduction Steps

On a Fedora 35 machine:

git clone https://pagure.io/dotnet-sig/dotnet6.0.git
cd dotnet6.0
git checkout abafa176a7ac41bc6b2ebf84040bd39bca21c15a
sudo dnf build-dep dotnet5.0 -y
./build-dotnet-tarball --bootstrap 9e8b04bbff820c93c142f99a507a46b976f5c14c
# Wait
fedpkg --release f35 local
# After an hour or so, build hangs

Edit: I updated these instructions to use commit abafa176a7ac41bc6b2ebf84040bd39bca21c15a since later versions will disable crossgen completely to try and work around this issue.

Expected behavior

Build (more specifically, crossgen2) works to completion

Actual behavior

crossgen2 hangs

Regression?

This was working on Fedora 34. It’s a regression somewhere in the software stack, but I am not sure where the root cause lies.

Known Workarounds

None that I know of, at least on Fedora 35.

Configuration

$ uname -a
Linux terminus 5.14.17-301.fc35.x86_64 #1 SMP Mon Nov 8 13:57:43 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release 
NAME="Fedora Linux"
VERSION="35 (Workstation Edition)"
ID=fedora
VERSION_ID=35
VERSION_CODENAME=""
PLATFORM_ID="platform:f35"
PRETTY_NAME="Fedora Linux 35 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:35"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f35/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=35
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=35
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Workstation Edition"
VARIANT_ID=workstation

Other information

No response

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 7
  • Comments: 58 (53 by maintainers)

Most upvoted comments

With great help from @jakobbotsch, the NullReferenceException culprit was discovered. I’ll create a PR fixing those two issues in a minute.

I am almost convinced now that the issue with NullReferenceException is also due to the clang 13 compilation. I was able to find a way to get a 100% repro in the CscBench coreclr test by disabling tiered jitting (COMPlus_TieredCompilation=0). The generated code for one of the methods is wrong, explicitly passing null as one of arguments to a call. Since the runtime built using an older clang doesn’t have this issue (I’ve verified that), I suppose that the problem is likely that runtime returns a wrong response to one of the JIT2EE interface calls.

I am looking into it and I have found a clang 13 codegen bug that causes e.g. the baseservices/exceptions/simple/fault test to fail with an unhandled exception. The problem is that the bug causes incorrect decoding of number of EH clauses in the Main method in this test (it thinks there are zero clauses). The codegen issue happens here: https://github.com/dotnet/runtime/blob/4da6b9a8d55913c0ea560d63590d35dc942425be/src/coreclr/inc/corhlpr.h#L582 The Align call is ignored and the returned pointer is not aligned and in this case, it points to an address that’s three bytes lower than the correct one (00007FFFF40D62CD instead of 00007FFFF40D62D0). I don’t know if that’s the only codegen issue in the whole runtime or if it is just a tip of the iceberg. However, when I’ve manually aligned that address in the debugger, the test has passed.

@janvorli Try this on a Fedora 35 machine/vm/container:

git clone https://pagure.io/dotnet-sig/dotnet6.0.git
cd dotnet6.0
git checkout clang13-hack
# replace runtime-pragma-pack.patch with the actual fix (generated via `git diff` or `git format-patch`, etc) 
sudo dnf build-dep dotnet5.0 -y
./build-dotnet-tarball --bootstrap 9e8b04bbff820c93c142f99a507a46b976f5c14c
# Wait about 10 minutes
fedpkg --release f35 local
# Wait about an hour

Edit: if that works, can you try using the just-built source-built to compile itself? It should look like this

pushd x86_64/
sudo dnf install $(ls | grep -v debug) -y
popd
sed -i -E 's|^%bcond_without bootstrap|%bcond_with bootstrap|' dotnet6.0.spec
sed -i -r 's|(Release: *)([0-9]+)|echo "\1$((\2+1))"|ge' dotnet6.0.spec
./build-dotnet-tarball 9e8b04bbff820c93c142f99a507a46b976f5c14c
# wait 10 minutes
fedpkg --release f35 local
# wait about an hour

Alternatively, try https://github.com/dotnet/source-build#building, but before step 4 (running ./build.sh) apply your changes to the files under ./src/runtime.4822e3c3aa77eb82b2fb33c9321f923cf11ddde6/ via patch or even just manually.

@omajid the comment is correct and the structure is aligned in the data. The problem is in the way we are computing the aligned position.

Oops, I am sorry, I’ve missed the subtle difference.

Here is the repro:

https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:‘1’,fontScale:14,fontUsePx:‘0’,j:1,lang:c%2B%2B,selection:(endColumn:1,endLineNumber:20,positionColumn:1,positionLineNumber:20,selectionStartColumn:1,selectionStartLineNumber:20,startColumn:1,startLineNumber:20),source:'%23include+<stdint.h> class+My { ++++int+value%3B ++++void*+Test1()%3B ++++void*+Test2()%3B }%3B void*+My::Test1() { ++++return+(void*)(((uintptr_t)this)+%2B+3)%3B } void*+My::Test2() { ++++return+(void*)((((uintptr_t)this)+%2B+3)+%26+~3)%3B } '),l:‘5’,n:‘0’,o:‘C%2B%2B+source+%231’,t:‘0’)),k:42.54317111459969,l:‘4’,n:‘0’,o:‘’,s:0,t:‘0’),(g:!((h:compiler,i:(compiler:clang1300,filters:(b:‘0’,binary:‘1’,commentOnly:‘0’,demangle:‘0’,directives:‘0’,execute:‘1’,intel:‘0’,libraryCode:‘1’,trim:‘1’),flagsViewOpen:‘1’,fontScale:14,fontUsePx:‘0’,j:2,lang:c%2B%2B,libs:!(),options:‘-O3’,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:‘1’),l:‘5’,n:‘0’,o:‘x86-64+clang+13.0.0+(C%2B%2B,+Editor+%231,+Compiler+%232)’,t:‘0’)),header:(),k:24.12349555206698,l:‘4’,m:48.128235762644366,n:‘0’,o:‘’,s:0,t:‘0’),(g:!((h:output,i:(compiler:2,editor:1,fontScale:14,fontUsePx:‘0’,tree:‘1’,wrap:‘1’),l:‘5’,n:‘0’,o:‘Output+of+x86-64+clang+13.0.0+(Compiler+%232)’,t:‘0’)),k:33.33333333333333,l:‘4’,n:‘0’,o:‘’,s:0,t:‘0’)),l:‘2’,n:‘0’,o:‘’,t:‘0’)),version:4

It looks like clang optimizations started assuming that the C++ this pointer is properly aligned. It breaks the code that overlays C++ classes over misaligned memory.