transformers: seems meet the GPU memory leak problem

I wrap the ``BertModel’’ as a persistent object and init it once, then iteratively use it as the feature extractor to generate the feature of data batch, while it seems I met the GPU memory leak problem. After starting the program, the GPU memory usage keeps increasing until ‘out-of-memory’. Some key codes are as following! Every ‘self.bert_model.get_bert_feature()’ executed, the GPU memory increased. I did simple debugging, and maybe the problem caused by the ‘class BertEmbeddings.forward()’. My pytorch version is 0.4.0, py3. Waiting for your reply, thanks very much!

class BertModel(PreTrainedBertModel):
    def __init__(self, config):
        super(BertModel, self).__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=False):
        #logger.info('bert forward')
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # We create a 3D attention mask from a 2D tensor mask.
        # Sizes are [batch_size, 1, 1, to_seq_length]
        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
        # this attention mask is more simple than the triangular masking of causal attention
        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        embedding_output = self.embeddings(input_ids, token_type_ids)
        encoded_layers = self.encoder(embedding_output,
                                      extended_attention_mask,
                                      output_all_encoded_layers=output_all_encoded_layers)
        return encoded_layers

class Bert_Instance(object):
    def __init__(self, vocab_file, bert_model_path, device):
        #tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
     
        self.tokenizer = BertTokenizer(vocab_file)
        self.model = BertModel.from_pretrained(bert_model_path)
        self.device = device
        print ('bert_device=', self.device)
        self.model.to(self.device)
        self.model.eval()

        for para in self.model.parameters():
            para.requires_grad = False

    def get_feature(self, text_list, max_seq_length=50, layer=-1):
        '''
        Args:
            text_list is a list to store the sentences, length is the sentence_number
        Return:
            (batch_size, seq_len+2, hidden_size)
        '''
        # a list, each dict element key is (ex_index, tokens, input_ids, input_mask, input_type_ids)
        all_features = convert_examples_to_features(examples=text_list,
                                                    max_seq_length=max_seq_length,
                                                    tokenizer=self.tokenizer)

        all_input_ids = torch.tensor([f['input_ids'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)
        all_input_mask = torch.tensor([f['input_mask'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)

        all_encoder_layers = self.model(all_input_ids,
                                        token_type_ids=None,
                                        attention_mask=all_input_mask)
       return all_encoder_layers, all_input_mask


class Bert_Model(object):
    def __init__(self, device):
        self.bert_model = Bert_Instance(BERT_VOCAB, BERT_MODEL, device)
        self.device = device
        self.zp_pre_cache = {}
        self.zp_post_cache = {}
        self.candi_np = {}
        self.cache = {'zp_pre': self.zp_pre_cache,
                      'zp_post': self.zp_post_cache,
                      'candi_np': self.candi_np}

    def get_bert_feature(self, text_list, cache_name, batch_id, max_seq_length=30, layer=-1):
        if batch_id in self.cache[cache_name].keys():
            #res = torch.tensor(self.cache[cache_name][batch_id]).type(torch.cuda.FloatTensor).to(self.device)
            res = self.cache[cache_name][batch_id]
            return res
        else:
            res = self.bert_model.get_feature(text_list, max_seq_length, layer)
            self.cache[cache_name][batch_id] = res
            return res

class Experiment(object):
    def __init__(self):
        # load training data   
        with open(DIR+"data/train_data", "rb") as fin1, \
             open(DIR+"data/emb","rb") as fin2:
            self.train_generator = cPickle.load(fin1)
            self.embedding_matrix, _ , _ = cPickle.load(fin2, encoding='iso-8859-1')
        # load test data
        self.test_generator = DataGenerator("test", 256)
        self.dev_data = self.train_generator.generate_dev_data()
        self.test_data = self.test_generator.generate_data()

        # declare model architecture
        self.model = Network(nnargs["embedding_size"], nnargs["embedding_dimension"], self.embedding_matrix, nnargs["hidden_dimension"], 2).to(NET_DEVICE)
        self.bert_model = Bert_Model(BERT_DEVICE)

        this_lr = 0.003
        self.optimizer = optim.Adagrad(self.model.parameters(), lr = this_lr)
        self.best = {"sum":0.0, "test_f":0.0, "best_test_f":0.0}
        self.dropout = nnargs["dropout"]


 def forward_step(self, data, mode, dropout=0.0):
        zp_relative_index, zp_pre, zp_pre_mask, zp_post, zp_post_mask, candi_np, candi_np_mask, feature, zp_pre_words, zp_post_words, candi_np_words, batch_id = data2tensor(data)

        batch_id = mode + '_' + str(batch_id)
        zp_pre_bert, _ = self.bert_model.get_bert_feature(zp_pre_words, 'zp_pre', batch_id)
        zp_post_bert, _ = self.bert_model.get_bert_feature(zp_post_words, 'zp_post', batch_id)
        candi_np_bert, _ = self.bert_model.get_bert_feature(candi_np_words, 'candi_np', batch_id)
        .....

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 18 (4 by maintainers)

Commits related to this issue

Most upvoted comments

Maybe use the torch.no_grad() context-manager which is the recommended way to perform inference with PyTorch now? See https://pytorch.org/docs/stable/autograd.html#torch.autograd.no_grad

thomwolf on Jan 16, 2019

I have the newest version of pytorch and transformers, yes.

I have been monitoring the memory usage over 24h when I made ~ 300.000 requests. It seems that the memory increases constantly for quite some time but also seems to stabilize at a certain maximum. So the application started using ~2.5GB RAM and now stays at ~4.3GB.

Maybe it has something to do with varying lengths of the texts I process? So that the longest texts are processed at a later point in time which then require the most RAM. Then, any subsequent text cannot need more so it stabilizes. Though this is just a thought.

Thanks already for your help, I’m off to Christmas vacations for now and will have a look at the issue in January again. I’ll see if memory usage increases by then.

RomanTeucher on Dec 20, 2019

So I tried it with bert-base-multilingual-uncased as well and it is the same behavior. I do not understand, why memory constantly grows on inference. To my understanding, I only push data through the network and then use the result layer’s output. Before using the transformers, I had been using custom word embeddings trained in own keras models and I did not have this behavior. What am I missing here?

RomanTeucher on Dec 19, 2019