runtime: Fatal error. Internal CLR error. (0x80131506)

Description

 at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.PointDouble(PointAccum)
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultStrausVar(UInt32[], UInt32[], PointAffine, PointAccum)
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplVerify(Byte[], Int32, Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32)
   at zero.cocoon.models.CcDiscoveries+<ConsumeAsync>d__16.MoveNext()

This started happening regularly and somewhat reproducibly since .Net 7 Preview 5 a couple of versions ago. My next level scheduler & async/await optimizations were also taking affect so I cannot be sure. It could just be caused by the increased load achieved instead of something in the runtime.

zero.cocoon.models.CcDiscoveries+<ConsumeAsync>d__16.MoveNext() is the part that reads and parses delivered msgs from UDP backend with protobuf & and tries to verify signatures. The signature verification fails. The rate at which this function is called is high, hence eventually the error happens. So the error self is very low probability but the cluster makes it higher.

Reproduction Steps

Clone & compile https://github.com/tactical-drone/zero and run it.

Cluster bootstrap tests regularly fail with that error (1/5) chance.

Expected behavior

Profit.

Actual behavior

Randomly every 2nd or 3rd run I get this error in RELEASE mode. I have never seen this error happen in DEBUG.

What is awesome is, that it is very reproducible. Borg technology is designed to push the runtime to the limits.

Regression?

This worked in .net 7 preview 4

Known Workarounds

None

Configuration

CPU: amd 3900x RAM: 64GB 7.0.0-preview.5.22266.11

Other information

I have no idea. The cluster test needs 10Gb memory in RELEASE mode, 20GB in DEBUG.

PointDouble Does a tonne of small byte array allocations and does some basic math on it littered with bit shifts.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 48 (32 by maintainers)

Most upvoted comments

@jakobbotsch Good point. Let me explain the reason I mention these CAS issues, is that while I was redesigning my semaphore over and over again it seemed like these parts had the biggest effect on the program totally crashing in these strange ways. The CAS instructions lead to valuetasks that break throwing the exceptions above at high rates causing the original issue. But in a way you are correct, these are separate issues.

Naturally I don’t think these bugs will occur if you use .net like a civilized individual. These errors will not occur when I disable my scheduler and my scheduler depends on 2 CAS instructions to work. The rest of the concurrency issues are solved by using runtime tech instead of my own contraptions. Unless the glue provided in ZeroCore is fundamentally wrong. I spent a lot of time coming up with that design (using runtime tech instead of another contraption), so I would be surprised but sad if there was a fundamental issue in the way I multiplex ValueTaskSourceCores

To be clear. I don’t want to send you guys on a wild goose chase when I mess with custom schedulers. Obviously any crashes reported is a good thing. The least I can do is tell you how to disable my custom bits so that you can decern better if you are chasing my bugs.

tactical-drone on May 25, 2022