spaCy: Memory leak with en_core_web_trf model

there is a Memory leak when using pipe of en_core_web_trf model, I run the model using GPU with 16GB RAM, here is a sample of the code.

!python -m spacy download en_core_web_trf

import en_core_web_trf
nlp = en_core_web_trf.load()

#it's just an array of 100K sentences.
data = dataload()

for index, review in enumerate( nlp.pipe(data, batch_size=100) ):
    #doing some processing here
    if index % 1000: print(index)

this code cracks when reaching 31K, and raises OOM error.

CUDA out of memory. Tried to allocate 46.00 MiB (GPU 0; 11.17 GiB total capacity; 10.44 GiB already allocated; 832.00 KiB free; 10.72 GiB reserved in total by PyTorch)

I just use the pipeline to predict, not train any data or other stuff and tried with different batch sizes, but nothing happened, still, crash.

Your Environment

spaCy version: 3.0.5
Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
Pipelines: en_core_web_trf (3.0.0)

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 7
Comments: 19 (3 by maintainers)

Most upvoted comments

Focusing on inference (not training) for this particular issue:

I can’t find any behavior that looks like a memory leak and the only way I can reproduce an out of memory error with en_core_web_trf is with a batch or doc* that is too long. I also checked on CPU with valgrind and couldn’t find any memory leaks.

*Correction based on #7268: it looks like long docs are more likely to cause problems than long batches of shorter texts. The same text split up into shorter texts in one single batch does not cause OOM errors even when the same text does as a single doc. Very long batches may still cause issues, of course.

In the original report, the batch size of 100 may be too large given the text lengths. If a batch is too long, the error looks like this:

  File "/home/adriane/spacy/venv/spacy/lib/python3.6/site-packages/transformers/models/roberta/modeling_roberta.py", line 225, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA out of memory. Tried to allocate 1.71 GiB (GPU 0; 7.79 GiB total capacity; 2.96 GiB already allocated; 869.00 MiB free; 4.77 GiB reserved in total by PyTorch)

In this comment above, I think the issue is that you’re saving a list of docs, which each contain saved tensors as part of doc._.trf_data, and those tensors are stored on the GPU:

docs = list(nlp.pipe(texts))

In contrast, in a loop like this, the tensor data in doc._.trf_data is garbage collected at some point after the end of each iteration:

for doc in nlp.pipe(texts):
    assert doc.has_annotation("TAG")
    # do some other processing, but don't save the whole doc

The data saved in doc._.trf_data is required while the pipeline is running (the components that listen to the transformer reference these tensors), but after all the listening components have run, you don’t need to keep it unless you need it for further processing. One simple workaround is to add a final custom component that sets doc._.trf_data = None, which means the tensors will be garbage collected and freed. See: https://github.com/explosion/spaCy/discussions/7486#discussioncomment-512106

If you do want to store all the docs with the TransformerData, I think you could convert the tensors to numpy arrays on CPU instead. I think the simplest way is something like this with .get():

doc._.trf_data.tensors = [x.get() for x in doc._.trf_data.tensors]

You can also add a final custom component that does this step if you want it to run as part of the pipeline.

For reference, I tested with:

GeForce RTX 2070
CUDA 11.0
cupy-cuda110 8.6.0 (I also tried 9.0.0)
torch 1.6.0
transformers 4.5.0

Notes:

if I process the same short doc repeatedly, the GPU memory usage does not appear change:
- this means it doesn’t look like a memory leak related to processing in spacy
- it could theoretically be related to a poorly managed cache, but as far as I can tell there are no caches involved on the GPU side of things
at startup, the reported GPU memory usage increases every time a longer batch is processed than has been processed so far, and then when the GPU memory is almost full, memory gets freed automatically by pytorch
there are old bug fixes in cupy related to dlpack that fixed memory leaks, but in versions earlier than those supported by spacy (fixed in v5.0.0b3)
you can call torch.cuda.empty_cache() and cupy.get_default_memory_pool().free_all_blocks() to free memory manually at an earlier point than it would be automatically, but it shouldn’t be necessary

If you’re still running into this problem, could you include additional details about the versions of libraries where you see this problem (CUDA, cupy, torch, transformers, thinc, spacy) and the exact code that you’re running.

adrianeboyd on Apr 28, 2021

Thanks, that’s helpful to see.

Can you try configuring the model to periodically flush the pytorch cache? That’s the most obvious built-in option that might help. It’s not enabled by default, the comments in the code say it shouldn’t be necessary, and it doesn’t look like we need to do this while training en_core_web_trf, so I’m very uncertain about whether it will help, but just to see try setting this to something between 0-1:

nlp.get_pipe("transformer").model.attrs["flush_cache_chance"] = 0.1

0.1 could well be too high, but I’m not sure what value makes sense. This means that randomly 10% of the time in the forward pass there’s an additional call to torch.cuda.empty_cache().

This setting isn’t saved with the model, so if it does help there’s some room for improvement here, but it would be interesting to hear if this did make a difference?

It might also be helpful to see if a simpler pipeline like only ['transformer', 'tagger'] runs into the same problem?

I’ve replicated the problem with long documents locally but haven’t tried to replicate this yet myself…

adrianeboyd on Mar 30, 2021

@nikjohn7: Memory usage while training is a separate issue. This issue is focused on prediction/inference only.

Could you open a new discussion thread with all the details about your training setup? Unfortunately I don’t see an easy way for me to convert your original comment into a discussion thread…

adrianeboyd on Apr 28, 2021

@adrianeboyd I’m also having the same issue with en_core_web_trf model. It works fine on my smaller dataset (12k) but gives the OOM error when I try with my 60k dataset. I’m using spacy3 with config files and have set the size of examples and batch size to 50, but it is still not working. See my config file below. I am running the model with GPU and 16 GiB memory.

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 50
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
source = "en_core_web_trf"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}

[components.transformer]
source = "en_core_web_trf"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 50
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 50
buffer = 50
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

cocochrane on Apr 6, 2021

thank you for your suggestion @adrianeboyd but I tried torch.cuda.empty_cache(), but what I found that the memory of GPU wasn’t affected, there are some things that still هccupied a place in the memory and that not make sense because the model is loaded in memory, and the pipeline used just to predict, I tried 1K of my data, and it succeeded, but the memory of GPU didn’t free after deleting the model and remove cash, you should restart the interpreter to get free GPU memory.

moroclash on Mar 31, 2021

This behavior seems to come from having one very long doc. The batch size can currently set the number of docs to process in a batch, but individual docs aren’t split up in any way if they’re very long. Can you check if there’s a particularly long doc at one point in your data?

adrianeboyd on Mar 29, 2021