wandb: OSError Too many open files: /tmp/tmphv67gzd0wandb-media
I have been using yolov5’s wandb and it is giving me this error
File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/weakref.py", line 642, in _exitfunc
File "/opt/conda/lib/python3.8/weakref.py", line 566, in __call__
File "/opt/conda/lib/python3.8/tempfile.py", line 817, in _cleanup
File "/opt/conda/lib/python3.8/tempfile.py", line 813, in _rmtree
File "/opt/conda/lib/python3.8/shutil.py", line 714, in rmtree
File "/opt/conda/lib/python3.8/shutil.py", line 712, in rmtree
OSError: [Errno 24] Too many open files: '/tmp/tmphv67gzd0wandb-media'
when doing Genetic Algorithm for hyperparameter evolution. Any idea on why wandb is doing this?
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 4
- Comments: 39 (8 by maintainers)
For me, I figured out this issue arose when I was logging a large number of artifacts (in my case, even just a single
wandb.Tablebut at every time step for a few thousand steps) and using a cluster with a lowulimit -n(4096 in my case). I fixed it by changing my code to not log artifacts, meaning I lost significant amounts of the value of wandb in the first place, but at least I was able to run my experiments and have them finish without stalling.If logging many artifacts is the main reason for this issue, then my guess is the wandb team made a naive assumption about how users would use their tools, they built software that opens and keeps open at least 1 file for every artifact logged, and this solution did not generalize to novel real-world use cases. Especially with the new NLP tools, I hope this issue gets more attention, as it’s really useful to be able to get a bunch of tabular data at each step.
@MBakirWB @shawnlewis Here’s some simple code to repro this, run
ulimit -n 64before to artificially set the file limit low for testing, though I expect this to also happen with 1024 or even 4096 (from experience).Around step 186 do I start to get the
Too many open files:error, then in the early 200s I start to see weirder tracebacks and errors. Finally, it hangs afterLogged "Step 999"rather than finishing the run and exiting.When I check the web UI, the logs and the view of the table seem to have stopped after Step 21, not even getting to the CLI-visible first error at 186.
Run on Ubuntu via Windows Subsystem for Linux with the following
uname -a(though first encountered on a Linux SLURM cluster for my university).@maxzw, thank-you for writing in. This bug is still being addressed. We will update the community here once a fix has been implemented. Regards
I am still epxeriencing this 😦 This has to be open again
@pvtien96 do you think you will be able to provide a small repro example? to help us further debug this?
I have also hit this issue - lost 12 long-running experiments, brutal!
I had this issue during a sweep. My hacky fix is to restart the sweep agent after every run, and bump
ulimit -n 4096from 1024. Hopefully, this will work… this is a super annoying bug.In your code, do not log artifacts if you are (e.g. don’t do
wandb.log({... wandb.Table(...)})or other things that create artifacts.I hope some wandb people are looking at this. This seems pretty bad and renders wandb quite unusable for researchers using shared compute clusters who need to log many artifacts (especially for NLP or CV).
@jxmorris12 Unfortunately, that didn’t help. Thanks anyway for posting, maybe it works for others.
@jonasjuerss This workaround worked for me (so far):
Is there any workaround? I am running jobs on slurm and at some point, wandb just stops logging because of this. I stopped the run via wandb, hoping it would sync after canceling, but it’s just stuck in “stopping” forever. I really rely on this for my thesis,
Happened to me too. Killed a long-running training run, huge inconvenience
Thanks @Antsypc, we’re aware of the issue and are actively working on a fix. It should be released in the next version of our client library due out in a week or so.