gitbase: In memory caching lead to crash

Issue

In the context of doing topic modeling experiments, @m09 and myself tried to use Gitbase to parse all blobs in tagged references of a given repository, in order to extract all identifiers, comments and literals. However, we have not been able to successfully use Gitbase to do so, and have had to switch to doing the parsing client side.

The reason for that is that, when querying Gitbase, we see the following behavior:

An increase in memory usage.
No decrease after time goes by.
When all available memory is consumed, an increase in block I/O and a quasi stagnation of the memory consumed by Gitbase at 99.999 … %, indicating heavy use of Swap memory.
Server crash if the query goes on for too long past that point.

We still see the same behavior when retrieving only the blob contents from Gitbase, however the memory consumed is not an issue, as it is much less then when parsing UASTs. We have inferred that there was some caching going one, and after talking about the issue on the dev-processing channel, we tried to disable the caching - however it changed nothing. Javi told us that the caching we had disabled was for go-git cache, so it is probably something else.

What we don’t understand is why we cannot get rid of the behavior, i.e. why once a blob has been parsed and returned client side it seemingly remains in memory.

Steps to reproduce

Launch gitbase and babelfish containers:

docker run -d --rm --name bblfshd --privileged -p 9432:9432 -m 4g bblfsh/bblfshd:v2.14.0-drivers
docker run -d --rm --name gitbase -p 3306:3306 --link bblfshd:bblfshd -e BBLFSH_ENDPOINT=bblfshd:9432 -m 2g -v /path/to/repos:/opt/repos srcd/gitbase:latest

With /path/to/repos pointing to a repository, for instance pytorch. Then, open two more terminals to monitor what’s happening with docker stats, and run queries like this one for example for pytorch, using for example the mySQL client:

   SELECT
        cf.file_path,
        cf.blob_hash,
        LANGUAGE(cf.file_path) as lang,
        uast_extract(uast(f.blob_content, LANGUAGE(cf.file_path), '//uast:String'), "Value")
    FROM repositories r
        NATURAL JOIN refs rf
        NATURAL JOIN commit_files cf
        NATURAL JOIN files f
    WHERE r.repository_id = 'pytorch'
        AND is_tag(rf.ref_name)
        AND lang ='Python'

You should see the memory usage of the gitbase container increase sharply until hitting 2 GB, then a heavy increase in BLOCK I/O, and finally the container will crash.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 20 (13 by maintainers)

Most upvoted comments

I just tested, so for the query above it seems to be working, I saw on docker stats that the memory usage stagnates around 1.13 GB after hitting the limit 👍

For the query above, which requires some caching before returning the result:

   SELECT
        cf.file_path,
        cf.blob_hash,
        LANGUAGE(cf.file_path) as lang,
        COUNT(uast_extract(uast(f.blob_content, LANGUAGE(cf.file_path), '//uast:String'), "Value"))
    FROM repositories r
        NATURAL JOIN refs rf
        NATURAL JOIN commit_files cf
        NATURAL JOIN files f
    WHERE r.repository_id = 'pytorch'
        AND is_tag(rf.ref_name)
        AND lang ='Python'
    GROUP BY lang, cf.file_path, cf.blob_hash;

The memory stagnated at about 1.26 GB, but still finished. It did not increase even after subsequent queries. So yeah, looks good guys 😃

r0mainK on Sep 23, 2019

I just tested this with the gitbase image built from master, using the same set up described by @r0mainK but adding -e MAX_MEMORY=1024 to the gitbase container to limit the cache memory to 1GiB and it seems to work. The memory usage of the gitbase container did not exceed ~1.3GiB, the block I/O did not go crazy and the query actually finished.

It is not yet released, but you can take an early look if you need it, @r0mainK.

agarciamontoro on Sep 19, 2019

It’s very hard to accurately ensure it does not go over the hard limit, so it’s more of a soft limit. But should not be much bigger than the limit that has been set.

erizocosmico on Sep 30, 2019

UPDATE: beta3 does not fix the error, we’re cutting a rc1 version with the complete fix.

erizocosmico on Sep 19, 2019

Currently there’s a PR #957 adding it to gitbase. When it’s merged and released let’s give it a try to see if this is solved.

erizocosmico on Aug 27, 2019

This will probably be fixed by https://github.com/src-d/gitbase/issues/929 so let’s put it on blocked until there is a release containing that.

erizocosmico on Aug 14, 2019

@r0mainK we have several LRU caches based on the number of elements, not on the total size of these elements. We should have a look into that and find ways to homogenize how to set limits to cache more user-friendly.

ajnavarro on Jul 8, 2019

@ajnavarro unfortunately, still on the same example (Python files in Pytorch’s tagged references):

mysql> SELECT COUNT((t.blob_hash, t.file_path)) FROM (
    ->    SELECT DISTINCT
    ->         cf.blob_hash,
    ->         cf.file_path,
    ->         LANGUAGE(cf.file_path) as lang
    ->     FROM repositories r
    ->         NATURAL JOIN refs rf
    ->         NATURAL JOIN commit_files cf
    ->         NATURAL JOIN files f
    ->     WHERE r.repository_id = 'pytorch'
    ->         AND is_tag(rf.ref_name)
    ->         AND lang ='Python' ) t;
+-----------------------------------+
| COUNT((t.blob_hash, t.file_path)) |
+-----------------------------------+
|                              4211 |
+-----------------------------------+

mysql> SELECT COUNT((t.blob_hash, t.file_path)) FROM (
    ->    SELECT DISTINCT
    ->         cf.blob_hash,
    ->         cf.file_path,
    ->         LANGUAGE(cf.file_path) as lang
    ->     FROM repositories r
    ->         NATURAL JOIN refs rf
    ->         NATURAL JOIN commit_files cf
    ->         NATURAL JOIN files f
    ->     WHERE r.repository_id = 'pytorch'
    ->         AND is_tag(rf.ref_name)
    ->         AND lang ='Python' 
    ->         AND NOT IS_BINARY(f.blob_content)
    ->         AND f.blob_size < 1000000
    ->     ) t;
+-----------------------------------+
| COUNT((t.blob_hash, t.file_path)) |
+-----------------------------------+
|                              4211 |
+-----------------------------------+

As you can see, the issue does not seem to be mitigated using these additional clauses, as they don’t apply. Furthermore, the intent is to process all files out of tagged references, not only the HEAD - hence the is_tag clause.

Additionally, the issue is that even when we try to separate large queries into smaller ones in order to parse less blobs per query, for example parsing language by language or ref by ref, since the memory is not released it ended up changing nothing.

EDIT: had not seem your comment on GITBASE_UAST_CACHE_SIZE, I thought it would be restricted by setting the cache option when launching gitbase, gonna check if it works 😃

r0mainK on Jul 8, 2019