GitPython: Some files are never closed
I’m working on a script using GitPython to loop over lots of Git repos, pull remote changes, etc…
At some point, I noticed that doing repo.is_dirty() was opening some files which never got closed until the process exits, which in turns causes the tool to crash with “too many open files” after iterating over enough repos:
import os
import git
# Put here whatever Git repos you might have in the current folder
# Listing more than one makes the problem more visible
repos = ["ipset", "libhtp"]
raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
os.getcwd()))
for name in repos:
repo = git.Repo(name)
repo.is_dirty()
del repo
raw_input("Check open files again") # files are still open
I tried digging deeper down the GitPython stack, but couldn’t find the actual cause.
In case that’s helpful, below is the same as the previous snippet, but using directly the lowest-level gitdb objects I could find to open the files:
import os
import git
from gitdb.util import hex_to_bin
# Put here whatever Git repos you might have in the current folder
# Listing more than one makes the problem more visible
repos = ["ipset", "libhtp"]
raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
os.getcwd()))
raw_input("Check open files again") # files are still open
for name in repos:
repo = git.Repo(name)
sha = hex_to_bin("71acab3ca115b9ec200e440188181f6878e26f08")
for db in repo.odb._dbs:
try:
for item in db._entities: # db._entities opens *.pack files
item[2](sha) # opens *.idx files
except AttributeError as e:
# Not all DBs have this attribute
pass
del repo
raw_input("Check open files again") # files are still open
But that’s just where the files are open (well, in fact they are open even lower down but I didn’t manage to go deeper), not where they are supposed to be closed.
About this issue
- Original URL
- State: closed
- Created 12 years ago
- Comments: 28 (10 by maintainers)
Commits related to this issue
- trying to fix memory leak issue with instantiating repositories with GitCmdObjectDB(https://github.com/gitpython-developers/GitPython/issues/60#issuecomment-148637260) — committed to bavaria95/jens by bavaria95 8 years ago
- fixing file descriptor leaks using GitCmdObjectDB when initiating repo(suggestion of library creator; https://github.com/gitpython-developers/GitPython/issues/60#issuecomment-148637260 ; for the futu... — committed to bavaria95/jens by bavaria95 8 years ago
@virgiliu: sure. I just hope @Byron doesn’t mind that we’re using his bug tracker to discuss like this. 😃
So, in one of my projects I’m using the following solution. I haven’t pushed it yet (mostly because I’m not entirely satisfied with it) so I can’t just give you a link to the code. You can assume that the code below is MIT-licensed.
So first, I have the following decorator:
Then, you just need to decorate a method with it, and it gets run as a subprocess!
In my case, I have a class
Foobarwhich does stuff on various Git repositories:Unfortunately the
do_stuff_on_repomethod goes through the wrong code path and I call it repeatedly (even on different repositories), so it leaks memory and fds.As a result, I’m just defining it in this way:
All I changed is that I decorated it, and now each call to the function will be run in its own process, so the leaking memory and fds get cleared when the process exits.
The only tricky part which is not handled above is what to do in case of an error: if
do_some_stuff_on_reporaises an exception for example, then subprocess just exits, without affecting the main process where your application is running.How to handle that properly is left as an exercise to the reader. (because that’s the part I’m not very happy with 😛)
I think it will work if you use the latest
HEADof this repository, which would bev1.0.2if I hadn’t lost access to my pypi account.In
v1.0.1, you should be able to instantiate theRepotype withgit.Repo(path, odbt=GitCmdObjectDB)to alleviate the issue.If this works for you, I’d be glad if you could let me know here as well.