runtime: A self-built crossgen2 hangs on some recent Linux distributions
Description
I don’t fully understand everything, but wanted to file this bug to raise awareness and get tips on how to narrow down the issue.
I am using source-build on Fedora 35 to build all of .NET 6. In this environment, the source-built crossgen2 hangs when trying to build parts of ASP.NET Core.
ps
shows that this command has been running since last evening, without any progress:
omajid 138650 0.0 0.2 3636216 81968 pts/1 Sl+ Nov15 0:00 dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/package-cache/microsoft.netcore.app.crossgen2.linux-x64/6.0.0/tools/crossgen2 --targetarch:x64 --targetos:linux -O @dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/src/artifacts/obj/Microsoft.AspNetCore.App.Runtime/Release/net6.0/linux-x64/crossgen/PlatformAssembliesPathsCrossgen2.rsp --perfmap --perfmap-format-version:1 --perfmap-path:dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/src/artifacts/bin/Microsoft.AspNetCore.App.Runtime/Release/net6.0/linux-x64/ -o:dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/src/artifacts/bin/Microsoft.AspNetCore.App.Runtime/Release/net6.0/linux-x64/Microsoft.Extensions.Caching.Abstractions.dll dotnet6.0/dotnet-9e8b04-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/package-cache/microsoft.extensions.caching.abstractions/6.0.0/lib/netstandard2.0/Microsoft.Extensions.Caching.Abstractions.dll
A sha256sum
lets me confirm that this is the same bit-by-bit identical crossgen2
built from the runtime repo (and not one fetched from a nuget package):
$ sha256sum $(find -iname crossgen2 -type f)
76119b25d2b83fe97f9e7cc16848d5cbe36962c077668b4dd14e5c5d3fd746ed ./dotnet-9e8b04bbff820c93c142f99a507a46b976f5c14c-x64-bootstrap/src/runtime.4822e3c3aa77eb82b2fb33c9321f923cf11ddde6/artifacts/source-build/self/src/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net6.0/fedora.35-x64/output/crossgen2
76119b25d2b83fe97f9e7cc16848d5cbe36962c077668b4dd14e5c5d3fd746ed ./dotnet-9e8b04bbff820c93c142f99a507a46b976f5c14c-x64-bootstrap/src/runtime.4822e3c3aa77eb82b2fb33c9321f923cf11ddde6/artifacts/source-build/self/src/artifacts/bin/coreclr/Linux.x64.Release/crossgen2/crossgen2
76119b25d2b83fe97f9e7cc16848d5cbe36962c077668b4dd14e5c5d3fd746ed ./dotnet-9e8b04bbff820c93c142f99a507a46b976f5c14c-x64-bootstrap/src/aspnetcore.ae1a6cbe225b99c0bf38b7e31bf60cb653b73a52/artifacts/source-build/self/package-cache/microsoft.netcore.app.crossgen2.linux-x64/6.0.0/tools/crossgen2
As a point of comparison, this works fine on Fedora 34.
Relevant versions of packages that I am using:
$ rpm -q clang llvm lldb lttng-ust gcc make
clang-13.0.0~rc1-1.fc35.x86_64
llvm-13.0.0~rc1-1.fc35.x86_64
lldb-13.0.0~rc1-1.fc35.x86_64
lttng-ust-2.12.2-5.fc35.x86_64
gcc-11.2.1-1.fc35.x86_64
make-4.3-6.fc35.x86_64
Also tested with the final (non-rc) clang packages: clang-13.0.0-3.fc35.x86_64 llvm-13.0.0-4.fc35.x86_64
, and the result is the same.
@alucryd suggested this might be a clang 13 issue: https://github.com/dotnet/source-build/issues/2602#issuecomment-968293263
This was also observed by another user on Fedora 35: https://pagure.io/dotnet-sig/dotnet6.0/issue/1
Reproduction Steps
On a Fedora 35 machine:
git clone https://pagure.io/dotnet-sig/dotnet6.0.git
cd dotnet6.0
git checkout abafa176a7ac41bc6b2ebf84040bd39bca21c15a
sudo dnf build-dep dotnet5.0 -y
./build-dotnet-tarball --bootstrap 9e8b04bbff820c93c142f99a507a46b976f5c14c
# Wait
fedpkg --release f35 local
# After an hour or so, build hangs
Edit: I updated these instructions to use commit abafa176a7ac41bc6b2ebf84040bd39bca21c15a
since later versions will disable crossgen completely to try and work around this issue.
Expected behavior
Build (more specifically, crossgen2) works to completion
Actual behavior
crossgen2 hangs
Regression?
This was working on Fedora 34. It’s a regression somewhere in the software stack, but I am not sure where the root cause lies.
Known Workarounds
None that I know of, at least on Fedora 35.
Configuration
$ uname -a
Linux terminus 5.14.17-301.fc35.x86_64 #1 SMP Mon Nov 8 13:57:43 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="35 (Workstation Edition)"
ID=fedora
VERSION_ID=35
VERSION_CODENAME=""
PLATFORM_ID="platform:f35"
PRETTY_NAME="Fedora Linux 35 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:35"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f35/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=35
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=35
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Workstation Edition"
VARIANT_ID=workstation
Other information
No response
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 7
- Comments: 58 (53 by maintainers)
With great help from @jakobbotsch, the NullReferenceException culprit was discovered. I’ll create a PR fixing those two issues in a minute.
This was the clang change that changed the behavior: https://github.com/llvm/llvm-project/commit/0aa0458f1429372038ca6a4edc7e94c96cd9a753
I am almost convinced now that the issue with NullReferenceException is also due to the clang 13 compilation. I was able to find a way to get a 100% repro in the CscBench coreclr test by disabling tiered jitting (
COMPlus_TieredCompilation=0
). The generated code for one of the methods is wrong, explicitly passingnull
as one of arguments to a call. Since the runtime built using an older clang doesn’t have this issue (I’ve verified that), I suppose that the problem is likely that runtime returns a wrong response to one of the JIT2EE interface calls.I am looking into it and I have found a clang 13 codegen bug that causes e.g. the baseservices/exceptions/simple/fault test to fail with an unhandled exception. The problem is that the bug causes incorrect decoding of number of EH clauses in the Main method in this test (it thinks there are zero clauses). The codegen issue happens here: https://github.com/dotnet/runtime/blob/4da6b9a8d55913c0ea560d63590d35dc942425be/src/coreclr/inc/corhlpr.h#L582 The
Align
call is ignored and the returned pointer is not aligned and in this case, it points to an address that’s three bytes lower than the correct one (00007FFFF40D62CD instead of 00007FFFF40D62D0). I don’t know if that’s the only codegen issue in the whole runtime or if it is just a tip of the iceberg. However, when I’ve manually aligned that address in the debugger, the test has passed.@janvorli Try this on a Fedora 35 machine/vm/container:
Edit: if that works, can you try using the just-built source-built to compile itself? It should look like this
Alternatively, try https://github.com/dotnet/source-build#building, but before step 4 (running
./build.sh
) apply your changes to the files under./src/runtime.4822e3c3aa77eb82b2fb33c9321f923cf11ddde6/
viapatch
or even just manually.@omajid the comment is correct and the structure is aligned in the data. The problem is in the way we are computing the aligned position.
Oops, I am sorry, I’ve missed the subtle difference.
Here is the repro:
https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:‘1’,fontScale:14,fontUsePx:‘0’,j:1,lang:c%2B%2B,selection:(endColumn:1,endLineNumber:20,positionColumn:1,positionLineNumber:20,selectionStartColumn:1,selectionStartLineNumber:20,startColumn:1,startLineNumber:20),source:'%23include+<stdint.h> class+My { ++++int+value%3B ++++void*+Test1()%3B ++++void*+Test2()%3B }%3B void*+My::Test1() { ++++return+(void*)(((uintptr_t)this)+%2B+3)%3B } void*+My::Test2() { ++++return+(void*)((((uintptr_t)this)+%2B+3)+%26+~3)%3B } '),l:‘5’,n:‘0’,o:‘C%2B%2B+source+%231’,t:‘0’)),k:42.54317111459969,l:‘4’,n:‘0’,o:‘’,s:0,t:‘0’),(g:!((h:compiler,i:(compiler:clang1300,filters:(b:‘0’,binary:‘1’,commentOnly:‘0’,demangle:‘0’,directives:‘0’,execute:‘1’,intel:‘0’,libraryCode:‘1’,trim:‘1’),flagsViewOpen:‘1’,fontScale:14,fontUsePx:‘0’,j:2,lang:c%2B%2B,libs:!(),options:‘-O3’,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:‘1’),l:‘5’,n:‘0’,o:‘x86-64+clang+13.0.0+(C%2B%2B,+Editor+%231,+Compiler+%232)’,t:‘0’)),header:(),k:24.12349555206698,l:‘4’,m:48.128235762644366,n:‘0’,o:‘’,s:0,t:‘0’),(g:!((h:output,i:(compiler:2,editor:1,fontScale:14,fontUsePx:‘0’,tree:‘1’,wrap:‘1’),l:‘5’,n:‘0’,o:‘Output+of+x86-64+clang+13.0.0+(Compiler+%232)’,t:‘0’)),k:33.33333333333333,l:‘4’,n:‘0’,o:‘’,s:0,t:‘0’)),l:‘2’,n:‘0’,o:‘’,t:‘0’)),version:4
It looks like clang optimizations started assuming that the C++
this
pointer is properly aligned. It breaks the code that overlays C++ classes over misaligned memory.