xgboost: xgboost.dask.train (cpu) high memory usage when increasing max_depth

We are trying to understand memory usage for xgboost.dask.train and what is the expected behavior as max_depth increases.

Running a LocalCluster in my laptop mac m1 - 16GB memory dask version = 2023.6.0 xgboost version = 1.7.6

Notes: I tried to run max_depth 20 and it killed my laptop.

Reproducer (fabricated example, to showcase problem)

import dask
import xgboost
from dask.distributed import Client
from distributed.diagnostics import MemorySampler

client = Client()
ms = MemorySampler()

dtypes = {}

for i in range(40):
    dtypes["f-%d" % i] = float
    
for i in range(40):
    dtypes["i-%d" % i] = int
    
df = dask.datasets.timeseries(
    start="2000-01-01",
    end="2001-01-01",
    dtypes=dtypes,
    partition_freq="4d",
    freq="10s",
)

df = df.persist()  # this is a 1.9GiB dataset 
X = df.drop("i-1", axis=1)
y = df["i-1"]

d_train = xgboost.dask.DaskDMatrix(
    None, X, y, enable_categorical=True
)


#this will give the curve corresponding to 18 but you can run it with different values and get the plot below
with ms.sample("md_18"):

    model = xgboost.dask.train(
        None,
    {
            "max_depth": 18, 
        },
        d_train,
        evals=[(d_train, "train")],
    )
    
ms.plot(align=True, grid=True)   

md_6-12-18

We noticed that if we include num_parallel_trees things are different, but it’s not clear to us why, note that the cases for 18 and 20 were taking too long to finish, so we cut them off (they were over 10 min on the 7th iteration)

    model = xgboost.dask.train(
        None,
    {
            "max_depth": 18, 
            "num_parallel_tree": 100, 
        },
        d_train,
        evals=[(d_train, "train")],
    )

num_parallel_trees = 100 md_12-18-20_npt100

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

I’m working on it, among other things.

Will 2.0 support it and when we can try it in master version

I don’t have an eta. We need to redesign the cache for the parallel histogram builder. Should be possible for 2.0.