wandb: wandb is leaking file pointers?
In my code, I am creating and closing wandb loggers to different projects (I have multiple training pipelines running at the same time). I noticed that wandb seems to be leaking open files. This is causing problems with my code because at a certain point there are too many open files so the job is killed.
Here is a script which replicates this problem:
import psutil
import wandb
def print_file_info():
proc = psutil.Process()
print('Num open files: %s' % len(proc.open_files()))
for filename in proc.open_files():
print('\t%s' % filename.path)
print('Before any WANDB stuff')
print_file_info()
run = wandb.init(project='test', name='test', reinit=True, resume='allow')
run_id = run.id
run.finish()
print('After creating run and getting run ID')
print_file_info()
run = wandb.init(id=run_id, project='test', name='test', reinit=True, resume='allow')
run.finish()
print('After accessing run again')
print_file_info()
test_file = open('test_file.txt', 'w')
print('After creating a normal file pointer')
print_file_info()
test_file.close()
print('After closing that file')
print_file_info()
If I run with WANDB_SILENT=true python wandb_file_leak_test.py, the output will be something like:
Before any WANDB stuff
Num open files: 0
After creating run and getting run ID
Num open files: 1
/home/alsuhr/Documents/testing/wandb/run-20201031_132102-j0ebkg08/logs/debug.log
After accessing run again
Num open files: 2
/home/alsuhr/Documents/testing/wandb/run-20201031_132102-j0ebkg08/logs/debug.log
/home/alsuhr/Documents/testing/wandb/run-20201031_132106-j0ebkg08/logs/debug.log
After creating a normal file pointer
Num open files: 3
/home/alsuhr/Documents/testing/wandb/run-20201031_132102-j0ebkg08/logs/debug.log
/home/alsuhr/Documents/testing/wandb/run-20201031_132106-j0ebkg08/logs/debug.log
/home/alsuhr/Documents/testing/test_file.txt
After closing that file
Num open files: 2
/home/alsuhr/Documents/testing/wandb/run-20201031_132102-j0ebkg08/logs/debug.log
/home/alsuhr/Documents/testing/wandb/run-20201031_132106-j0ebkg08/logs/debug.log
Notice how the test file is opened and then the file pointer to it is gone in the last check. What is going on? How can I make sure that these file pointers are actually closed by wandb?
Thanks!
Forgot to mention:
- wandb version 0.10.8
- python version 3.7.6
uname -a: Linux bigbox 4.15.0-112-generic #113~16.04.1-Ubuntu SMP Fri Jul 10 04:37:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
More details:
- I can replicate this even if I remove
reinit=Trueandresume='allow'. - The documentation seems to suggest this file is only being written to if
WANDB_SILENT=true(https://docs.wandb.com/library/environment-variables#optional-environment-variables). However, I can replicate this even I remove that environment variable (just set it above so it’s easier to read the output).
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 23 (5 by maintainers)
Issue-Label Bot is automatically applying the label
bugto this issue, with a confidence of 0.62. Please mark this comment with 👍 or 👎 to give our bot feedback!Links: app homepage, dashboard and code for this bot.
EDIT: the issue didn’t get reopened, so I opened a new one: #3974
Comment to repoen: running into the same issues (both too many files and leaked semaphores) on
wandb=0.12.21, Python 3.9.13, MacOS Big Sur 11.6.Code to reproduce:
The “leaked semaphores” appears during the first run, but this may be happening because I’ve been testing in the same shell. The full error:
I got the “too many open files” error after the 18th run. The error message is quite long, but appears to just repeat the following segment:
In my case, I tried to stop using TQDM (it was only used in 3 loops). After that, the problem with file handles disappeared.
Thanks for the update. We just released 0.10.29 that will give us more debugging information about the leaked semaphores (this means python thinks sub processes should still exist, but it can’t find them).
Hey @GillesJ we’re looking into this. In the mean time, you can avoid this by calling
wandb agent sweep_idfrom the command line and configure the sweep to execute a program that calls thetrainfunction.Thanks for the reproduction cases, we have raised the priority and will get this fixed soon.
Hi, sorry, I actually haven’t worked on this as I figured out a workaround for the second issue (basically I made it so my code only opens wandb at max a few times per process) and got caught up in my research.
I had started creating a PR but couldn’t set up the testing environment for wandb iirc (or something else was hard to get working on my machine) and other things came up so I forgot. I can try creating one again (it’s on my todo list) but no promises when I can get it done due to deadlines / etc. Thanks!