scikit-learn: PyPy tests timeouts / memory usage investigation
EDIT: one of the main causes of the problem described below has already been fixed by #27670. However, despite this improvement, there are still important memory problems remaining when running the scikit-learn test suite on PyPy. So similar investigation and fixes are needed to iteratively solve the next worst offenders until the tests can run with an amount of memory comparable to what we observe with CPython (instead of a factor of 10).
Original description:
I had a closer look at the PyPy tests which have been timing out for a while, here is the result of my investigation. This may also help in the future to have a central issue for this rather than the discussion being split in different automatically created issues in this repo and scikit-learn-feedstock.
The PyPy tests locally needs ~11GB on my machine whereas it is 1.2GB with CPython. I ran them without using pytest-xdist to simplify things.
PyPy
CPython
It seems like one of the where the memory usage grows with time is the linear_model tests (needs 3.4GB with PyPy and 200MB with CPython locally).
PyPy
CPython
I manage to reproduce the issue (memory growing way more than on CPython) with the following snippet, where one of our Cython loss functions is called many times in a tight loop:
import psutil
import gc
from functools import partial
import platform
import numpy as np
from sklearn._loss.loss import HalfGammaLoss
IS_PYPY = platform.python_implementation() == "PyPy"
def func(data):
loss = HalfGammaLoss()
for i in range(10_000):
loss(data, data)
def main():
for i in range(101):
if i % 10 == 0:
memory_usage = psutil.Process().memory_info().rss / 1e9
message = f'{i} psutil: {memory_usage:.3f}GB'
if IS_PYPY:
pypy_memory_usage = gc.get_stats().memory_allocated_sum
message += f' , pypy total allocated: {pypy_memory_usage}'
print(message)
n_samples, n_features = 4, 12
rng = np.random.RandomState(0)
raw_prediction = rng.uniform(low=-3, high=3, size=n_samples)
func(raw_prediction)
if IS_PYPY:
print(gc.get_stats())
if __name__ == '__main__':
main()
Output shows that the memory usage grows with time and that it is about 10 times the total memory allocated reported by PyPy:
0 psutil: 0.175GB , pypy total allocated: 69.2MB
10 psutil: 0.537GB , pypy total allocated: 113.9MB
20 psutil: 0.888GB , pypy total allocated: 157.4MB
30 psutil: 1.257GB , pypy total allocated: 201.4MB
40 psutil: 1.595GB , pypy total allocated: 247.4MB
50 psutil: 1.931GB , pypy total allocated: 291.9MB
60 psutil: 2.325GB , pypy total allocated: 330.4MB
70 psutil: 2.659GB , pypy total allocated: 374.4MB
80 psutil: 3.003GB , pypy total allocated: 423.9MB
90 psutil: 3.327GB , pypy total allocated: 460.9MB
100 psutil: 3.663GB , pypy total allocated: 508.4MB
Not entirely sure whether this is a red herring, there is an issue with our Cython loss implementation, or this is an issue at the interaction between Cython and PyPy. Maybe @mattip has some insights about this?
On CPython the memory usage is stable around 100MB
0 psutil: 0.097GB
10 psutil: 0.098GB
20 psutil: 0.098GB
30 psutil: 0.098GB
40 psutil: 0.098GB
50 psutil: 0.098GB
60 psutil: 0.098GB
70 psutil: 0.098GB
80 psutil: 0.098GB
90 psutil: 0.098GB
100 psutil: 0.098GB
My environment for good measure
mamba list
# packages in environment at /home/lesteve/micromamba/envs/pypy:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
ca-certificates 2023.7.22 hbcca054_0 conda-forge
colorama 0.4.6 pyhd8ed1ab_0 conda-forge
cython 3.0.3 py39hc10206b_0 conda-forge
exceptiongroup 1.1.3 pyhd8ed1ab_0 conda-forge
execnet 2.0.2 pyhd8ed1ab_0 conda-forge
expat 2.5.0 hcb278e6_1 conda-forge
gdbm 1.18 h0a1914f_2 conda-forge
iniconfig 2.0.0 pyhd8ed1ab_0 conda-forge
joblib 1.3.2 pyhd8ed1ab_0 conda-forge
libblas 3.9.0 19_linux64_openblas conda-forge
libcblas 3.9.0 19_linux64_openblas conda-forge
libexpat 2.5.0 hcb278e6_1 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 13.2.0 h807b86a_2 conda-forge
libgfortran-ng 13.2.0 h69a702a_2 conda-forge
libgfortran5 13.2.0 ha4646dd_2 conda-forge
libgomp 13.2.0 h807b86a_2 conda-forge
liblapack 3.9.0 19_linux64_openblas conda-forge
libopenblas 0.3.24 pthreads_h413a1c8_0 conda-forge
libsqlite 3.43.2 h2797004_0 conda-forge
libstdcxx-ng 13.2.0 h7e041cc_2 conda-forge
libxcb 1.15 h0b41bf4_0 conda-forge
libzlib 1.2.13 hd590300_5 conda-forge
ncurses 6.4 hcb278e6_0 conda-forge
numpy 1.26.0 py39h6dedee3_0 conda-forge
openssl 3.1.4 hd590300_0 conda-forge
packaging 23.2 pyhd8ed1ab_0 conda-forge
pip 23.3 pyhd8ed1ab_0 conda-forge
pluggy 1.3.0 pyhd8ed1ab_0 conda-forge
psutil 5.9.5 py39hf860d4a_1 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
pypy 7.3.13 0_pypy39 conda-forge
pypy3.9 7.3.13 h9557127_0 conda-forge
pytest 7.4.2 pyhd8ed1ab_0 conda-forge
pytest-repeat 0.9.2 pyhd8ed1ab_0 conda-forge
pytest-xdist 3.3.1 pyhd8ed1ab_0 conda-forge
python 3.9.18 0_73_pypy conda-forge
python_abi 3.9 4_pypy39_pp73 conda-forge
readline 8.2 h8228510_1 conda-forge
scipy 1.11.3 py39h6dedee3_1 conda-forge
setuptools 68.2.2 pyhd8ed1ab_0 conda-forge
sqlite 3.43.2 h2c6b66d_0 conda-forge
threadpoolctl 3.2.0 pyha21a80b_0 conda-forge
tk 8.6.13 h2797004_0 conda-forge
tomli 2.0.1 pyhd8ed1ab_0 conda-forge
tzdata 2023c h71feb2d_0 conda-forge
wheel 0.41.2 pyhd8ed1ab_0 conda-forge
xorg-kbproto 1.0.7 h7f98852_1002 conda-forge
xorg-libx11 1.8.7 h8ee46fc_0 conda-forge
xorg-libxau 1.0.11 hd590300_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge
xorg-xproto 7.0.31 h7f98852_1007 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
zlib 1.2.13 hd590300_5 conda-forge
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 25 (20 by maintainers)
I found my linux machine and could try @lesteve script with the above changes. Interestingly, it removes the memory leak:
So we should probably try to look at the
np.asarray
that we are using to convert memoryview back to arrays since that there is something going there.Thanks for the minimum reproducer. I opened a pypy issue.
Thanks for the input! I said “timed out” which is not very accurate. I am reasonably confident the tests get killed because they use too much memory.
I would keep this one open since I may have another look in the future. Having said that, this is not very high priority, since PyPy tests time out a lot rarely after #27670 was merged.
A quick look at #27750 seems to indicate that PyPy tests timed out 5 times since November 8 (i.e. in 23 days or so) so the situation is a lot better than previously when it was always timing out.
Saving the reproducer as
test_pypy.pyx
, and runningcythonize test_pypy.pyx
results in a c-extension module. I can use it to recreate the problemSo cython has a class
_memoryviewslice
which has a__dealloc__
method, and hangs on to the original object, which is wrapped in a memoryview, which is wrapped in an ndarray. This is going to create gc cycles between objects in a way that I am not sure even thecpyext-gc-cycle
branch will be able to untangle.So maybe we can wait before attempting to fix the 37 occurrences of this pattern in scikit-learn. Please let us know @mattip if this is too complex to fix on the PyPy/Cython side: we could then change scikit-learn to avoid this pattern in the first place.
Great news. A smaller reproducer would be welcome, although I imagine it hits the C-python mutual reference cycle problem I mentioned above. At least I could add it to the PyPy FAQ of “things to avoid”.