GitPython: Some files are never closed

I’m working on a script using GitPython to loop over lots of Git repos, pull remote changes, etc…

At some point, I noticed that doing repo.is_dirty() was opening some files which never got closed until the process exits, which in turns causes the tool to crash with “too many open files” after iterating over enough repos:

import os

import git

# Put here whatever Git repos you might have in the current folder
# Listing more than one makes the problem more visible
repos = ["ipset", "libhtp"]

raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
                                                            os.getcwd()))

for name in repos:
    repo = git.Repo(name)
    repo.is_dirty()

    del repo

raw_input("Check open files again")                 # files are still open

I tried digging deeper down the GitPython stack, but couldn’t find the actual cause.

In case that’s helpful, below is the same as the previous snippet, but using directly the lowest-level gitdb objects I could find to open the files:

import os

import git
from gitdb.util import hex_to_bin

# Put here whatever Git repos you might have in the current folder
# Listing more than one makes the problem more visible
repos = ["ipset", "libhtp"]

raw_input("Check open files with `lsof -p %s | grep %s`" % (os.getpid(),
                                                            os.getcwd()))

raw_input("Check open files again")                 # files are still open

for name in repos:
    repo = git.Repo(name)
    sha = hex_to_bin("71acab3ca115b9ec200e440188181f6878e26f08")

    for db in repo.odb._dbs:
        try:
            for item in db._entities:               # db._entities opens *.pack files
                item[2](sha)                        # opens *.idx files

        except AttributeError as e:
            # Not all DBs have this attribute
            pass

    del repo

raw_input("Check open files again")                 # files are still open

But that’s just where the files are open (well, in fact they are open even lower down but I didn’t manage to go deeper), not where they are supposed to be closed.

About this issue

  • Original URL
  • State: closed
  • Created 12 years ago
  • Comments: 28 (10 by maintainers)

Commits related to this issue

Most upvoted comments

@virgiliu: sure. I just hope @Byron doesn’t mind that we’re using his bug tracker to discuss like this. 😃

So, in one of my projects I’m using the following solution. I haven’t pushed it yet (mostly because I’m not entirely satisfied with it) so I can’t just give you a link to the code. You can assume that the code below is MIT-licensed.

So first, I have the following decorator:

class subprocessed(object):
    """Decorator to run a function in a subprocess

    We can't use GitPython directly because it leaks memory and file
    descriptors:
        https://github.com/gitpython-developers/GitPython/issues/60

    Until it is fixed, we have to use multiple processes :(
    """
    def __init__(self, func):
        self.func = func

    def __get__(self, instance, *args, **kwargs):
        self.instance = instance
        return self

    def __call__(self, *args):
        from multiprocessing import Process, Queue
        q = Queue()

        def wrapper(queue, func, *args):
            queue.put(func(*args))

        p = Process(target=wrapper,
                    args=(q, self.func, self.instance) + args)
        p.start()
        p.join()

        return q.get()

Then, you just need to decorate a method with it, and it gets run as a subprocess!

In my case, I have a class Foobar which does stuff on various Git repositories:

class Foobar(object):
    def __init__(self):
        self.git_rooturl = "git://git.example.com"
        self.workdir = "/path/to/where/the/repo/will/be/cloned"

    def do_stuff_on_repo(self, reponame):
        curdir = os.getcwd()

        # Clone the module
        repo = git.Repo.clone_from("%s/%s" % (self.git_rooturl, reponame),
                                   os.path.join(self.workdir, reponame))
        os.chdir(repo.working_tree_dir)

        # Do your stuff here
        result = ...

        os.chdir(curdir)
        shutil.rmtree(repo.working_tree_dir)
        del repo

        return result

Unfortunately the do_stuff_on_repo method goes through the wrong code path and I call it repeatedly (even on different repositories), so it leaks memory and fds.

As a result, I’m just defining it in this way:

    @subprocessed
    def do_stuff_on_repo(self, reponame):
        ...

All I changed is that I decorated it, and now each call to the function will be run in its own process, so the leaking memory and fds get cleared when the process exits.

The only tricky part which is not handled above is what to do in case of an error: if do_some_stuff_on_repo raises an exception for example, then subprocess just exits, without affecting the main process where your application is running.

How to handle that properly is left as an exercise to the reader. (because that’s the part I’m not very happy with 😛)

I think it will work if you use the latest HEAD of this repository, which would be v1.0.2 if I hadn’t lost access to my pypi account.

In v1.0.1, you should be able to instantiate the Repo type with git.Repo(path, odbt=GitCmdObjectDB) to alleviate the issue.

If this works for you, I’d be glad if you could let me know here as well.