wandb: OSError Too many open files: /tmp/tmphv67gzd0wandb-media

I have been using yolov5’s wandb and it is giving me this error

File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
  File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
  File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
  File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
  File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
  File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
  File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
  File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'

when doing Genetic Algorithm for hyperparameter evolution. Any idea on why wandb is doing this?

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 4
  • Comments: 39 (8 by maintainers)

Most upvoted comments

For me, I figured out this issue arose when I was logging a large number of artifacts (in my case, even just a single wandb.Table but at every time step for a few thousand steps) and using a cluster with a low ulimit -n (4096 in my case). I fixed it by changing my code to not log artifacts, meaning I lost significant amounts of the value of wandb in the first place, but at least I was able to run my experiments and have them finish without stalling.

If logging many artifacts is the main reason for this issue, then my guess is the wandb team made a naive assumption about how users would use their tools, they built software that opens and keeps open at least 1 file for every artifact logged, and this solution did not generalize to novel real-world use cases. Especially with the new NLP tools, I hope this issue gets more attention, as it’s really useful to be able to get a bunch of tabular data at each step.

@MBakirWB @shawnlewis Here’s some simple code to repro this, run ulimit -n 64 before to artificially set the file limit low for testing, though I expect this to also happen with 1024 or even 4096 (from experience).

"""Reproduce https://github.com/wandb/wandb/issues/2825"""

import os
import wandb

# Limit the file limit to 64 with ulimit -n 64
os.system("ulimit -n 64")

wandb.init()

for i in range(1000):
    # Log a very simple artifact at each iteration
    data = f"Step {i}"
    wandb.log({"my_table": wandb.Table(columns=["text"], data=[[data]])})
    print(f'Logged "{data}"')

Around step 186 do I start to get the Too many open files: error, then in the early 200s I start to see weirder tracebacks and errors. Finally, it hangs after Logged "Step 999" rather than finishing the run and exiting.

Screenshot 2023-06-29 175855

When I check the web UI, the logs and the view of the table seem to have stopped after Step 21, not even getting to the CLI-visible first error at 186. image

Run on Ubuntu via Windows Subsystem for Linux with the following uname -a (though first encountered on a Linux SLURM cluster for my university).

Linux DESKTOP-7VO7NFL 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

@maxzw, thank-you for writing in. This bug is still being addressed. We will update the community here once a fix has been implemented. Regards

I am still epxeriencing this 😦 This has to be open again

@pvtien96 do you think you will be able to provide a small repro example? to help us further debug this?

I have also hit this issue - lost 12 long-running experiments, brutal!

I had this issue during a sweep. My hacky fix is to restart the sweep agent after every run, and bump ulimit -n 4096 from 1024. Hopefully, this will work… this is a super annoying bug.

I solved it by not logging the a artefacts. This is sad. There should be some way to do this more efficiently. I wasted a lot of time where my runs were crashing or the logging entirely stops.

How to disable logging the artefacts?

In your code, do not log artifacts if you are (e.g. don’t do wandb.log({... wandb.Table(...)}) or other things that create artifacts.

I hope some wandb people are looking at this. This seems pretty bad and renders wandb quite unusable for researchers using shared compute clusters who need to log many artifacts (especially for NLP or CV).

@jxmorris12 Unfortunately, that didn’t help. Thanks anyway for posting, maybe it works for others.

@jonasjuerss This workaround worked for me (so far):

import resource
resource.setrlimit(
    resource.RLIMIT_CORE, (resource.RLIM_INFINITY, resource.RLIM_INFINITY)
)

Is there any workaround? I am running jobs on slurm and at some point, wandb just stops logging because of this. I stopped the run via wandb, hoping it would sync after canceling, but it’s just stuck in “stopping” forever. I really rely on this for my thesis,

Happened to me too. Killed a long-running training run, huge inconvenience

Thanks @Antsypc, we’re aware of the issue and are actively working on a fix. It should be released in the next version of our client library due out in a week or so.