runtime: Environment.ProcessorCount incorrect reporting in containers in 3.1

We are trying to upgrade to .NET Core 3.1. But we that Environment.ProcessorCount reports different values on .NET Core 3.1.

We use the official SDK docker images. I’ve attached 2 sample projects, 1 for 3.0 and 1 for 3.1

Repro

To repro use the following command to run:

1. docker build . -f Dockerfile 
2. docker run --cpus=1 <image_built_from_#1>

Outcome

3.0

The number of processors on this computer is 1.

3.1

The number of processors on this computer is <actual_node_cpus>.

So if a machine has 8 cores, in 3.0 a container assigned 1 core we still got an outcome of 1, while in 3.1 the outcome is 8.

Is this change by design ?

Were are also seeing much higher cpu consumption on the 3.1 containers, our initial theory that the runtime is thinking it has more cores while it does not.

repro.zip

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 78 (57 by maintainers)

Most upvoted comments

There are three questions I see:

  • What is the correct behavior?
  • Is this change breaking and should it be reverted?
  • Is there a workaround and what is it?

I’ll try to answer them, in order:

  • I would say that CPU quota is a very nebulous concept. It is a synthetic concept that is about capacity, not about actual cores. I think the new behavior is more correct, because it is oriented on actual cores. If you set --cpus=7.0 on a 64 core machine, you will be executing your app on >7 cores. cpuset-cpus is the concept that very clearly limits you to a particular set of cores. The runtime does the right thing with that setting.
  • This is obviously a breaking change. I think if we don’t take this change now, we never will. I think we should accept the break because it is the best behavior and we definitely want .NET Core to be a container-native runtime. My take is that this is the behavior a container-native runtime should have. I read some posts and this only convinced me that there is no one behavior that works, so we should just go with reporting facts and then build more policy on top (not at the bottom).
  • Yes, there are workarounds. I think there are a few (in order of preference, IMO): align memory limits and CPU quota to be coherent, opt-in to work station GC, set GC heaps to 0 with the flag referenced earlier.

Note: There are my answers. There is no suggestion that they are not up for debate.

It would be great if folks running into this issue can try setting memory limits and see if that helps. That would be great input.

Related, we have been planning a container density investigation. We were hoping to do it in December, but it is likely to happen in January. I wish we had completed it already, so we’d have more data and experience with the scenario. I’m trying my best to “see around the corner” in absence of having done that exercise. In particular, I’m wanting to guess what I think the product default should be after we compete that exercise.

I really like where we landed for the memory limits default. It is super cut and dry and defendable. I’m having a lot of trouble defining a default for CPU quotas. @jkotas asked me what I thought the behavior should be at --cpus=7.0. I said that the behavior we have today is clearly correct. I immediately felt that we should have a different behavior for <=1.0 CPUs, that we should default to workstation GC, for example. In fact, that is what @Maoni0 and I discussed when we worked on the memory limits proposal (relative to how many GC heaps are created). After a bit more thought, a special behavior for <=1.0 CPUs gets really ugly as soon as you pop over to >1.0 CPUs. It’s not smooth at all. That’s bad. The nice thing about our memory limits behavior is that it scales smoothly.

I also thought about our upcoming density investigation and what we’ll value. I suspect we’ll want something more like this chart. We’re planning on running n instances of TechEmpower on a big (sharded) machine. We want to use as MUCH cpu as possible. That’s what I think our default should align with.

The behavior being requested feels more like an acquiesce sort of behavior. That’s totally rationale and makes sense. It’s just not what I think we should align with as a default behavior. We need to decide what the best way to configure an app for that behavior should be, and how that aligns with K8 configuration options.

Fair?

I just updated an existing sample to include cgroup info @ https://github.com/richlander/testapps/blob/master/versioninfo/Program.cs#L28-L33

Here is what it does being compiled and run on 3.0 and 3.1. I elided irrelevant info.

Scenario 1: 3.0 SDK with 1 CPU set

C:\git\testapps\versioninfo>docker run --rm --cpus=1 -m 60mb -v %cd%:/app -w /app mcr.microsoft.com/dotnet/core/sdk:3.0 dotnet run
**.NET Core info**
Version: 3.0.1

**Environment info**
ProcessorCount: 1

**CGroup info**
cfs_quota_us: 100000
memory.limit_in_bytes: 62914560
memory.usage_in_bytes: 62877696

Scenario 2: 3.0 SDK with 1.5 CPU set

C:\git\testapps\versioninfo>docker run --rm --cpus=1.5 -m 60mb -v %cd%:/app -w /app mcr.microsoft.com/dotnet/core/sdk:3.0 dotnet run
**.NET Core info**
Version: 3.0.1

**Environment info**
ProcessorCount: 2

**CGroup info**
cfs_quota_us: 150000
memory.limit_in_bytes: 62914560
memory.usage_in_bytes: 62812160

Scenario 3: 3.1 SDK with 1 CPU set

C:\git\testapps\versioninfo>docker run --rm --cpus=1 -m 60mb -v %cd%:/app -w /app mcr.microsoft.com/dotnet/core/sdk:3.1 dotnet run
**.NET Core info**
Version: 3.1.0

**Environment info**
ProcessorCount: 2

**CGroup info**
cfs_quota_us: 100000
memory.limit_in_bytes: 62914560
memory.usage_in_bytes: 62873600

Scenario 4: 3.1 SDK with CPU affinity set (to 1 core)

C:\git\testapps\versioninfo>docker run --rm --cpuset-cpus=0 -m 60mb -v %cd%:/app -w /app mcr.microsoft.com/dotnet/core/sdk:3.1 dotnet run
**.NET Core info**
Version: 3.1.0

**Environment info**
ProcessorCount: 1

**CGroup info**
cfs_quota_us: -1
memory.limit_in_bytes: 62914560
memory.usage_in_bytes: 62844928

CPU quota is the standard parameter of configuring Kubernetes containers. Before this change the quota got reflected in ProcessorCount (maybe semantically not correct), but it means applications were taking it into account. With the change, the cpu quota are no longer available and no longer used by .NET Core/.NET Core apps. The common case of a many containers with low cpu quota on multi-multi-core machine puts your app in a weird configuration (e.g 0.7 quota with ProcessorCount 122).

We may need to introduce new configuration settings or new APIs to address this properly.

+1 The cpu quota should be available in some form.

Just adding my two cents here - got hit hard by this today on updating to netcore3.1 on a 112-core server - where every pod running on this server had CPU limits set on Kubernetes, but now are behaving like they’ve access to all the 112 cores.

My impression is that decision to change how many cores the runtime considers seems to have been rushed out based on a single-instance benchmark and not on real-world usage of mixed pods in the same server.

Is there any documentation on how to properly set up CPU limits for netcore3.1 under Kubernetes with this change?

There’s an important difference here between CPU shares and limit. This is the same as request and limit in Kubernetes or --cpus and --cpu-limit in Docker.

--cpus/k8s request means “this is what I think I need” vs --cpu-limit/k8s limit this is where I should be capped/throttled.

I don’t think dotnet should set processor count based on --cpus/k8s request. It is understood that a process can exceed this limit (at the system’s potential peril) in order to drive up utilization.

When it comes to --cpu-limit/k8s limit it’s a little more complicated. That value is the maximum number of CPU-microseconds that the scheduler will allow the process, but of course that doesn’t say anything about maximum parllelism that is a good/performant idea.

My feeling is that in any high performance system, the degree of parallelism that works “well” is going to be highly dependent on the workload/code. But of course you probably want some heuristic value too.

So my recommendation would be: a) Have some default max_threads value in dotnet that libraries can use as the heuristic “best guess” value b) Set this max_threads based on some heuristic blend of --cpus and --cpu-limit c) Enable users to override this value via environment variable or flag if they want to see something different. d) Encourage library developers (e.g. HTTP Client) to also make their particular “HTTP_MAX_THREADS” or whatever configurable by the end user.

Basically, mixing # cores/cpu limit and max parallelism together is kind of a dangerous idea. It’s ok for the 80% use-case but I think it will fall down at high scale.

3.0 is taking into account cpu quota instead of using the number of cores of the physical machine. This makes .NET Core container aware.

That does not make it container aware. From my point of view it just lies to itself for its own benefit. There is a significant difference between the amount of CPU time a process is allowed to consume and the number of threads it is allowed to execute in parallel. .NET Core ❤️.1 ties those two things together. Moreover even with Environment.ProccessorCount == 1 multiple threads can be executing in parallel (as OS would still schedule them to multiple CPUs). Environment.ProccessorCount == 1 also forces a GC mode that is not overridable by any means (correct me if I am wrong), but I need to be able to use Server/Background GC even when CPU quota is <1500m (Environment.ProccessorCount == 1)

As I tried to explain in my original post (which more or less started all this), we are running a bunch of services that are running horizontally scaled (3-5) on K8s for the sake of resiliency/availability and none of them consume more than 1000m (put imprecisely , not even one whole CPU) but are highly parallel, bursty, I/O applications. We are forced by netcore 2.2/3 to run them with CPU quota set to 2000m just to keep Environment.ProccessorCount > 1. With Environment.ProccessorCount == 1 they grind to a halt under load (even though they would consume in some cases only 300m). Without CPU quota, thus Environment.ProccessorCount == 64 (in our case), they would consume ridiculous amounts of memory (due to heaps created) and I assume is the same behavior we would get with netcore 3.1. I guess we might be able to control this with knobs currently available but I don’t want to be forced to configure and maintain all of this per application. I want CoreCLR to be smart about it, possibly with some hints from the developer.

Just an update. Thanks for all this data/reports. It is super helpful. We conducted a quick investigation and are seeing similar results. We are starting a more formal investigation now.

Please feel free to share more data. We will look at it. However, our focus will now turn to more deep performance analysis. I hope to have more info to share soon (although the upcoming holidays may slow us down).

At this time, and with the information I have available, I would say that 3.0 is a better choice in CPU-limited containers. Please test 3.1 to ensure it is meeting your performance goals if you deploy it in production with CPU-limits.

My view is that the current implementation returns the truth. If you are running on a device with 112 CPUs, then even with the quota, you can still be running on 112 different cores over time. So the environment really has 112 different CPUs, thus the Environment.ProcessorCount reflects that.
But for some cases, it seems it is also important for applications to be able to query the current quota. So I believe we should expose it on the Environment. It also seems we should think again about the number of GC heaps and threadpool parameters with relation to the CPU count and quota. I don’t have a clear picture on how these should be related yet. Maybe the number of GC heaps should take the quota into account, but I am not sure.

I think reverting this change should be considered because of:

  • backwards compatibility: the value returned can be of a completely different magnitude
  • usability: cpu quota is the knob Kubernetes gives you, and this is no longer available

The change was made not because there was a functional issue, but to improve performance. The benchmarking performed is insufficient to validate performance does not regress when ProcessorCount returns much higher values than before.

correctness

fwiw, cpu quota and effective available cores aren’t completely orthogonal.

I think if we don’t take this change now, we never will.

We can do this in .NET 5. And include the necessary additional properties/configuration flags/…

we have been planning a container density investigation.

👍 👍

I’m glad we got the ProcessorCount changes figured out, and can keep 3.1 the same as previous versions. Good luck hunting for the 2.x/3.0 regression.

Let’s consider 3 aspects which are important: correctness, performance, and backwards-compatibility.

This change is not backwards compatible.

Performance has regressed, which is why additional changes are being proposed in .NET Core. Code outside .NET Core is affected also. For example, the ASP.NET Libuv transport uses ProcessorCount in a similar way as it affected the epoll threads and threadpool. For the additional change that is now being proposed for the nr of epoll threads. I’ve made a PR to set this to 1 previously, but it was not merged because performance was worse than at ProcessorCount.

Overall we don’t know if performance has now been improved/regressed. The main driver for pushing the change seems correctness. Imo performance and backwards-compatibility are much more important than correctness.

Such an option would give us an ability to switch between what is IMHO legacy netfx behavior (present in netcore ❤️.1) where CLR assumes it owns the machine, which was fine in dedicated server/VM era, and the new behavior that is more accurate/true and suitable for containers/shared machines.

3.0 is taking into account cpu quota instead of using the number of cores of the physical machine. This makes .NET Core container aware. I don’t understand why you call this ‘legacy netfx’ behavior.

Can you elaborate on this? Or how is this different than than the <ConservativeProcessorCount>true</ConservativeProcessorCount> suggestion?

@mrmartan is proposing to keep the default behavior as 3.0 and add a way to directly control the value returned by Environment.ProcessorCount (e.g. an envvar). It allows users to experiment, and to figure out the ‘best’ value for ProcessorCount.

I think that makes sense as a 3.1 config knob.

My main concern is about the changed default behavior.

I think it would good to introduce a new runtimeconfig option that would tell the runtime whether it should treat the quota as a limit on processor count or not. Without input from developer, the runtime has no way of knowing what the intended behavior is and, regardless of the implemented policy, there always will be 50% of apps what work great and 50% that do not. @janvorli, @jkotas, @VSadov, what do you think?

My main worry right now is memory allocation - as this seems to affect the number of heaps on Server GC (https://docs.microsoft.com/en-us/dotnet/core/run-time-config/garbage-collector#systemgcheapcountcomplus_gcheapcount).

It is a massive jump to go from 1 core to 112 cores, and while we can still configure this with the COMPlus_GCHeapCount flag, it’s quite surprising that this would be needed at all. If this is going to be a permanent change, I think a breaking change announcement would be in nice to have, and at least some documentation on how to control the behavior, specially because on Kubernetes it can be tricky (or impossible - haven’t fully understood their docs yet on this) to set the --cpusets option to get the former expected behavior.

Had to sadly roll back to netcore3.0 for now till we’ve some clarity here 😦

Agreed. But that’s just part of it, and to some degree the easy part. The crux of our challenge and investigation has been determining what the best default behavior should be for the runtime, and to a lesser degree the class library, in this mode.

To me it looks like you’re trying to find out what needs to change in the runtime to compensate for ProcessorCount no longer taking into account cpu quota. Ideally, runtime and class library use the same ‘recommended level of parallelism’.

The importance of taking into account cpu quota is because it does have an effect on parallelism. Especially on containers with low cpu quota, which is a common configuration in all production Kubernetes deployments I’ve seen.

For 3.1, it makes sense to revert https://github.com/dotnet/coreclr/pull/26153 imo. There were no issues reported for it, and it has worked well for 2.x and 3.0.

For .NET 5 we can see if we can relax the impact of cpu quota and how that improves performance. And maybe add some new APIs.

@richlander existing code is using ProcessorCount as an indicator for parallelism. That’s why we see these performance regressions.

Our internal investigation has not been able to demonstrate this. It’s the obvious conclusion, but we are yet to prove that.

Current status:

  • There are two main problems that have been observed: high memory costs due to fixed per GC-heap costs, throughput regressions.
  • For reducing memory costs, we recommend configuring the GC by setting COMPlus_GCHeapCount to a low value, likely matching the CPU quota, or use workstation GC by setting COMPlus_gcserver to 0. FYI: This is not our long-term plan.
  • For the throughput regression, we are planning to update the threadpool to align min threads with the CPU quota. We have seen that this change reduces most of the regression (for our tests). That’s the same fix as is @ https://hub.docker.com/r/richlander/aspnet. Anyone is free to test that build to validate the regression reduction we’ve seen.
  • We will treat this topic area as a high priority for .NET 5 and may back-port improvements to .NET Core 3.1 if the cost/benefit/risk trade-offs are good.

I know people are passionate about this. Our challenge is that we have not seen data that clearly sets a direction. If people want to work closely and/or privately with us with real world apps, we are interested and motivated. We’re also not closing the door to any solution, but do require good data to make changes. We also took a hard run at epoll threads. @stephentoub then reminded us that you had already done that @ https://github.com/dotnet/corefx/pull/36693.

I know that the ProcessorCount change was a break, but it’s only a break since 3.0 so it isn’t particularly convincing in-and-of itself.

We DEFINITELY want to do the RIGHT THING. We’re trying very hard to determine what that is.

ProcessorCount should not be affected by quota. It can be affected by changing affinity. Just to rule that out - is it possible that affinity mask is being set in you scenario?

Another idea is to add an opt-in experience along the lines of: <ConservativeProcessorCount>true</ConservativeProcessorCount>

Anyone think that’s a good idea?

I’m still sticking to the idea that the current 3.1 behavior is the best default. Basically, someone is going to be unhappy either way, so I’d rather go with the most accurate behavior/value. Fair?

@VSadov, you said “there is generally a way to limit processor count”. What is it? I think that “–cpuset-cpus” is not really suitable for this purpose because it sets hard affinity and limits ability of the OS scheduler to schedule threads efficiently.