duckdb: The default number of threads is wrong within a distributed environment

What happens?

When using duckdb from a Docker container in a Kubernetes cluster, the default number of threads used is the number of physical cores of the cluster, instead of the number of accessible cores. This results in oversubscribing issues, where there are far too many threads spawned compared to the number of cores to be used.

This is probably due to duckdb using shell commands like nproc to access the number of cores. Such commands are not cgroup aware and therefore yield incorrect results.

On the other hand, libraries like joblib inspect cgroup to give an accurate number of accessible cores. See this section of joblib.

cc @ogrisel that spotted the bug

To Reproduce

Run duckdb in a Kubernetes pod and see the number of spawned threads with htop.

import duckdb
con = duckdb.connect("duckdb.db")
# any sql query

Also run nproc in the shell (this will also give the number observed threads)

Then, install joblib:

pip install joblib
python -c "import joblib; print(joblib.cpu_count())"

This will give the number of accessible cores. Alternatively, directly query cgroup attributes like joblib:

expr $(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us) / $(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)

OS:

Ubuntu 20.04.4 LTS

DuckDB Version:

0.7.0

DuckDB Client:

Python 3.10

Full Name:

Vincent Maladiere

Affiliation:

Inria • scikit-learn

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (8 by maintainers)

Most upvoted comments

There is a ceil on the calculation of the quota/period division. But It’s missing a limit to 1, as you experienced

Alternatively, there might be a way to ask duckdb how many threads it wants to use directly but I am not aware on how to inspect this. We have a way to inspect settings, like threads:

select current_setting('threads');