wandb: wandb is leaking file pointers?

In my code, I am creating and closing wandb loggers to different projects (I have multiple training pipelines running at the same time). I noticed that wandb seems to be leaking open files. This is causing problems with my code because at a certain point there are too many open files so the job is killed.

Here is a script which replicates this problem:

import psutil
import wandb

def print_file_info():
    proc = psutil.Process()
    print('Num open files: %s' % len(proc.open_files()))
    for filename in proc.open_files():
        print('\t%s' % filename.path)

print('Before any WANDB stuff')
print_file_info()
run = wandb.init(project='test', name='test', reinit=True, resume='allow')
run_id = run.id
run.finish()

print('After creating run and getting run ID')
print_file_info()
run = wandb.init(id=run_id, project='test', name='test', reinit=True, resume='allow')
run.finish()

print('After accessing run again')
print_file_info()

test_file = open('test_file.txt', 'w')
print('After creating a normal file pointer')
print_file_info()

test_file.close()
print('After closing that file')
print_file_info()

If I run with WANDB_SILENT=true python wandb_file_leak_test.py, the output will be something like:

Before any WANDB stuff
Num open files: 0


After creating run and getting run ID
Num open files: 1
	/home/alsuhr/Documents/testing/wandb/run-20201031_132102-j0ebkg08/logs/debug.log


After accessing run again
Num open files: 2
	/home/alsuhr/Documents/testing/wandb/run-20201031_132102-j0ebkg08/logs/debug.log
	/home/alsuhr/Documents/testing/wandb/run-20201031_132106-j0ebkg08/logs/debug.log
After creating a normal file pointer
Num open files: 3
	/home/alsuhr/Documents/testing/wandb/run-20201031_132102-j0ebkg08/logs/debug.log
	/home/alsuhr/Documents/testing/wandb/run-20201031_132106-j0ebkg08/logs/debug.log
	/home/alsuhr/Documents/testing/test_file.txt
After closing that file
Num open files: 2
	/home/alsuhr/Documents/testing/wandb/run-20201031_132102-j0ebkg08/logs/debug.log
	/home/alsuhr/Documents/testing/wandb/run-20201031_132106-j0ebkg08/logs/debug.log

Notice how the test file is opened and then the file pointer to it is gone in the last check. What is going on? How can I make sure that these file pointers are actually closed by wandb?

Thanks!

Forgot to mention:

  • wandb version 0.10.8
  • python version 3.7.6
  • uname -a: Linux bigbox 4.15.0-112-generic #113~16.04.1-Ubuntu SMP Fri Jul 10 04:37:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

More details:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 23 (5 by maintainers)

Most upvoted comments

Issue-Label Bot is automatically applying the label bug to this issue, with a confidence of 0.62. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

EDIT: the issue didn’t get reopened, so I opened a new one: #3974

Comment to repoen: running into the same issues (both too many files and leaked semaphores) on wandb=0.12.21, Python 3.9.13, MacOS Big Sur 11.6.

Code to reproduce:

import wandb
for i in range(100):
    with wandb.init(entity='exr0nprojects', project='snap', group='useless'):
        print(f"run number {i}")

The “leaked semaphores” appears during the first run, but this may be happening because I’ve been testing in the same shell. The full error:

/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 54 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I got the “too many open files” error after the 18th run. The error message is quite long, but appears to just repeat the following segment:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_init.py", line 1043, in init
    run = wi.init()
  File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_init.py", line 556, in init
    backend.ensure_launched()
  File "/usr/local/lib/python3.9/site-packages/wandb/sdk/backend/backend.py", line 220, in ensure_launched
    self.wandb_process.start()
  File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 58, in _launch
    self.pid = util.spawnv_passfds(spawn.get_executable(),
  File "/usr/local/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/util.py", line 450, in spawnv_passfds
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files

The above exception was the direct cause of the following exception: [etc]

I don’t know for sure whether it was update 0.10.28 or updating tqdm to latest (which I was using and some indicate this could be a problem too), but the crashes with OSError for FileDescriptor limit have stopped. I do keep getting shell warnings for leaked semaphore objects if I interrupt the training process though.

In my case, I tried to stop using TQDM (it was only used in 3 loops). After that, the problem with file handles disappeared.

Thanks for the update. We just released 0.10.29 that will give us more debugging information about the leaked semaphores (this means python thinks sub processes should still exist, but it can’t find them).

Hey @GillesJ we’re looking into this. In the mean time, you can avoid this by calling wandb agent sweep_id from the command line and configure the sweep to execute a program that calls the train function.

Thanks for the reproduction cases, we have raised the priority and will get this fixed soon.

Hi, sorry, I actually haven’t worked on this as I figured out a workaround for the second issue (basically I made it so my code only opens wandb at max a few times per process) and got caught up in my research.

I had started creating a PR but couldn’t set up the testing environment for wandb iirc (or something else was hard to get working on my machine) and other things came up so I forgot. I can try creating one again (it’s on my todo list) but no promises when I can get it done due to deadlines / etc. Thanks!