gitbase: In memory caching lead to crash
Issue
In the context of doing topic modeling experiments, @m09 and myself tried to use Gitbase to parse all blobs in tagged references of a given repository, in order to extract all identifiers, comments and literals. However, we have not been able to successfully use Gitbase to do so, and have had to switch to doing the parsing client side.
The reason for that is that, when querying Gitbase, we see the following behavior:
- An increase in memory usage.
- No decrease after time goes by.
- When all available memory is consumed, an increase in block I/O and a quasi stagnation of the memory consumed by Gitbase at 99.999 … %, indicating heavy use of Swap memory.
- Server crash if the query goes on for too long past that point.
We still see the same behavior when retrieving only the blob contents from Gitbase, however the memory consumed is not an issue, as it is much less then when parsing UASTs. We have inferred that there was some caching going one, and after talking about the issue on the dev-processing channel, we tried to disable the caching - however it changed nothing. Javi told us that the caching we had disabled was for go-git cache, so it is probably something else.
What we don’t understand is why we cannot get rid of the behavior, i.e. why once a blob has been parsed and returned client side it seemingly remains in memory.
Steps to reproduce
Launch gitbase and babelfish containers:
docker run -d --rm --name bblfshd --privileged -p 9432:9432 -m 4g bblfsh/bblfshd:v2.14.0-drivers
docker run -d --rm --name gitbase -p 3306:3306 --link bblfshd:bblfshd -e BBLFSH_ENDPOINT=bblfshd:9432 -m 2g -v /path/to/repos:/opt/repos srcd/gitbase:latest
With /path/to/repos pointing to a repository, for instance pytorch. Then, open two more terminals to monitor what’s happening with docker stats, and run queries like this one for example for pytorch, using for example the mySQL client:
SELECT
cf.file_path,
cf.blob_hash,
LANGUAGE(cf.file_path) as lang,
uast_extract(uast(f.blob_content, LANGUAGE(cf.file_path), '//uast:String'), "Value")
FROM repositories r
NATURAL JOIN refs rf
NATURAL JOIN commit_files cf
NATURAL JOIN files f
WHERE r.repository_id = 'pytorch'
AND is_tag(rf.ref_name)
AND lang ='Python'
You should see the memory usage of the gitbase container increase sharply until hitting 2 GB, then a heavy increase in BLOCK I/O, and finally the container will crash.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 20 (13 by maintainers)
I just tested, so for the query above it seems to be working, I saw on docker stats that the memory usage stagnates around 1.13 GB after hitting the limit 👍
For the query above, which requires some caching before returning the result:
The memory stagnated at about 1.26 GB, but still finished. It did not increase even after subsequent queries. So yeah, looks good guys 😃
I just tested this with the gitbase image built from
master, using the same set up described by @r0mainK but adding-e MAX_MEMORY=1024to the gitbase container to limit the cache memory to 1GiB and it seems to work. The memory usage of the gitbase container did not exceed ~1.3GiB, the block I/O did not go crazy and the query actually finished.It is not yet released, but you can take an early look if you need it, @r0mainK.
It’s very hard to accurately ensure it does not go over the hard limit, so it’s more of a soft limit. But should not be much bigger than the limit that has been set.
UPDATE: beta3 does not fix the error, we’re cutting a rc1 version with the complete fix.
Currently there’s a PR #957 adding it to gitbase. When it’s merged and released let’s give it a try to see if this is solved.
This will probably be fixed by https://github.com/src-d/gitbase/issues/929 so let’s put it on blocked until there is a release containing that.
@r0mainK we have several LRU caches based on the number of elements, not on the total size of these elements. We should have a look into that and find ways to homogenize how to set limits to cache more user-friendly.
@ajnavarro unfortunately, still on the same example (Python files in Pytorch’s tagged references):
As you can see, the issue does not seem to be mitigated using these additional clauses, as they don’t apply. Furthermore, the intent is to process all files out of tagged references, not only the HEAD - hence the
is_tagclause.Additionally, the issue is that even when we try to separate large queries into smaller ones in order to parse less blobs per query, for example parsing language by language or ref by ref, since the memory is not released it ended up changing nothing.
EDIT: had not seem your comment on GITBASE_UAST_CACHE_SIZE, I thought it would be restricted by setting the cache option when launching gitbase, gonna check if it works 😃