wandb: OSError Too many open files: /tmp/tmphv67gzd0wandb-media

I have been using yolov5’s wandb and it is giving me this error

File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
  File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
  File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
  File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
  File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
  File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
  File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'

when doing Genetic Algorithm for hyperparameter evolution. Any idea on why wandb is doing this?

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 4
Comments: 39 (8 by maintainers)

Links to this issue

torchexplorer · PyPI

Most upvoted comments

For me, I figured out this issue arose when I was logging a large number of artifacts (in my case, even just a single wandb.Table but at every time step for a few thousand steps) and using a cluster with a low ulimit -n (4096 in my case). I fixed it by changing my code to not log artifacts, meaning I lost significant amounts of the value of wandb in the first place, but at least I was able to run my experiments and have them finish without stalling.

If logging many artifacts is the main reason for this issue, then my guess is the wandb team made a naive assumption about how users would use their tools, they built software that opens and keeps open at least 1 file for every artifact logged, and this solution did not generalize to novel real-world use cases. Especially with the new NLP tools, I hope this issue gets more attention, as it’s really useful to be able to get a bunch of tabular data at each step.

mukobi on May 19, 2023

@MBakirWB @shawnlewis Here’s some simple code to repro this, run ulimit -n 64 before to artificially set the file limit low for testing, though I expect this to also happen with 1024 or even 4096 (from experience).

"""Reproduce https://github.com/wandb/wandb/issues/2825"""

import os
import wandb

# Limit the file limit to 64 with ulimit -n 64
os.system("ulimit -n 64")

wandb.init()

for i in range(1000):
    # Log a very simple artifact at each iteration
    data = f"Step {i}"
    wandb.log({"my_table": wandb.Table(columns=["text"], data=[[data]])})
    print(f'Logged "{data}"')

Around step 186 do I start to get the Too many open files: error, then in the early 200s I start to see weirder tracebacks and errors. Finally, it hangs after Logged "Step 999" rather than finishing the run and exiting.

When I check the web UI, the logs and the view of the table seem to have stopped after Step 21, not even getting to the CLI-visible first error at 186.

Run on Ubuntu via Windows Subsystem for Linux with the following uname -a (though first encountered on a Linux SLURM cluster for my university).

Linux DESKTOP-7VO7NFL 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

mukobi on Jun 30, 2023

@maxzw, thank-you for writing in. This bug is still being addressed. We will update the community here once a fix has been implemented. Regards

MBakirWB on Jun 23, 2022

I am still epxeriencing this 😦 This has to be open again

abhinav-kashyap-asus on Apr 25, 2023

@pvtien96 do you think you will be able to provide a small repro example? to help us further debug this?

kptkin on Feb 17, 2024

I have also hit this issue - lost 12 long-running experiments, brutal!

UmaisZahid on Sep 10, 2023

I had this issue during a sweep. My hacky fix is to restart the sweep agent after every run, and bump ulimit -n 4096 from 1024. Hopefully, this will work… this is a super annoying bug.

iwishiwasaneagle on Jun 23, 2023

I solved it by not logging the a artefacts. This is sad. There should be some way to do this more efficiently. I wasted a lot of time where my runs were crashing or the logging entirely stops.

How to disable logging the artefacts?

In your code, do not log artifacts if you are (e.g. don’t do wandb.log({... wandb.Table(...)}) or other things that create artifacts.

I hope some wandb people are looking at this. This seems pretty bad and renders wandb quite unusable for researchers using shared compute clusters who need to log many artifacts (especially for NLP or CV).

mukobi on Jun 15, 2023

@jxmorris12 Unfortunately, that didn’t help. Thanks anyway for posting, maybe it works for others.

jonasjuerss on May 18, 2023

@jonasjuerss This workaround worked for me (so far):

import resource
resource.setrlimit(
    resource.RLIMIT_CORE, (resource.RLIM_INFINITY, resource.RLIM_INFINITY)
)

jxmorris12 on May 15, 2023

Is there any workaround? I am running jobs on slurm and at some point, wandb just stops logging because of this. I stopped the run via wandb, hoping it would sync after canceling, but it’s just stuck in “stopping” forever. I really rely on this for my thesis,

jonasjuerss on May 15, 2023

Happened to me too. Killed a long-running training run, huge inconvenience

jxmorris12 on May 9, 2023

Thanks @Antsypc, we’re aware of the issue and are actively working on a fix. It should be released in the next version of our client library due out in a week or so.

vanpelt on Oct 27, 2021