oneTBB: Oversubscription slowdown when run under container CPU quotas

When running an application in a Linux container environment (e.g. docker containers) it is often the case that the orchestrator configuration (kubernetes, docker compose/swarm) puts CPU quotas via Linux cgroup to avoid having one container app to use all the CPU of the neighboring apps running on the same host.

However TBB does not seem to introspect /sys/fs/cgroup/cpu/cpu.cfs_quota_us / /sys/fs/cgroup/cpu/cpu.cfs_period_usto understand how many tasks it can run concurrently resulting in significant slowdown caused by over-subscription. As it is not possible to set a TBB_NUM_THREADS environment variable in the container deployment configuration, this makes it challenging to efficiently deploy TBB-enabled apps on docker-managed servers.

Here is a reproducing settup using numpy from the default anaconda channel on a host machine with 48 threads (24 physical cores):

$ cat oversubscribe_tbb.py
import numpy as np
from time import time

data = np.random.randn(1000, 1000)
print(f"Calling np.linalg.eig shape={data.shape}:",
      end=" ", flush=True)
tic = time()
np.linalg.eig(data)
print(f"{time() - tic:.3f}s")

$ docker run --cpus 2 -ti -v `pwd`:/io continuumio/miniconda3 bash
(base) # conda install -y numpy tbb
(base) # MKL_THREADING_LAYER=tbb python /io/oversubscribe_tbb.py     
one eig, shape=(1000, 1000): 20.227s

By using a sequential execution or OpenMP with appropriately configured environment, the problem disappears:

(base) # MKL_THREADING_LAYER=sequential python /io/oversubscribe_tbb.
py 
Calling np.linalg.eig shape=(1000, 1000): 1.636s
(base) # MKL_THREADING_LAYER=omp OMP_NUM_THREADS=2 python /io/oversub
scribe_tbb.py                                                                           
Calling np.linalg.eig shape=(1000, 1000): 1.484s

Of course if OpenMP is used without setting OMP_NUM_THREADS to match the docker CPU quota, on also get a similar over-subscription problem as encountered with TBB:

(base) # MKL_THREADING_LAYER=omp python /io/oversubscribe_tbb.py     
Calling np.linalg.eig shape=(1000, 1000): 22.703s

Edit: the first version of this report mentioned MKL_THREADING_LAYER=omp instead of MKL_THREADING_LAYER=tbb in the first command (with duration 20.227s). I confirm that we also get 20s+ with MKL_THREADING_LAYER=tbb.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 1
  • Comments: 18 (4 by maintainers)

Most upvoted comments

Hi, @alexey-katranov.

In this context, my suggestion would be to interpret the CPU quota/period ratio to decide the default number of threads. If you read a quota of two times the period, that indicates you can use up to 2 CPUs worth of runtime for that period, which would map to two threads assuming they are busy 100% of the time. You could round up real numbers to the nearest integer.

A common example nowadays: You have 2 containers sharing a node with 8 cores, both of them running multithreaded workloads. These containers aren’t pinned to particular CPUs but are assigned a fraction of the bandwidth of the machine, using CFS Bandwidth Control (for instance, launching the containers with Docker’s --cpu option).

The scheduler implements this by running all threads of a control group for a fraction (cfs_quota) of an execution period (cfs_period), and when they used up all the quota, it will stop them. They remain stopped until the start of the next cycle. The problem arises when, in multithreaded workloads, the quota/period ratio is lower than the number of logical cores (TBB’s default). For instance, if in this 8 core machine, the cfs_quota is 800ms and the period is 200ms, we get a quota/period ratio of 4. This means that we can use up to 4 CPUs worth of runtime. To use this quota efficiently, it would be great if we spawned 4 threads in each container, but tbb will spawn 8. In this case, each thread will run in a different CPU, but only for 50% of an execution period, as we can only use up to 4 CPUs worth of runtime. After half of the period, all 8 threads will be yanked out of execution by the operating system and put to wait until they can run again during the next period. These context changes turn out to be very costly.

Just scale up the above example to a machine with 100 logical cores, and each container gets assigned two CPUs worth of quota. Now, 100 threads will be allowed to use as much CPU as two threads at full speed. That means that they constantly have to be switched out and switched in, and they are waiting 98% of the cfs_period. That’s not even counting the overhead due to context-switching 100 threads.

Hi, we ran into the same problem several weeks ago.

Sadly, the IPC solution described in the PR referenced above is not an option for us in terms of performance, and we had to resort back to reading cgroup files (https://github.com/root-project/root/blob/a7495ae4f697f9bf285835f004af3f14f330b0eb/core/imt/src/TPoolManager.cxx#L32).

However, we are reluctant to believe we are the only ones running into this problem when the virtualization of hardware resources is so widespread nowadays. Is this something the TBB team is thinking of addressing in the near future? Do you see it as something to be figured out on the user side, or do you agree TBB should be taking care of it?

For reference, loky (an alternative to concurrent.futures from the Python standard library) is CPU quota aware using the following logic:

https://github.com/joblib/loky/blob/f15594a44420abfa9b398be7ff3c9180c2858bf4/loky/backend/context.py#L181-L195