transformers: seems meet the GPU memory leak problem
I wrap the ``BertModel’’ as a persistent object and init it once, then iteratively use it as the feature extractor to generate the feature of data batch, while it seems I met the GPU memory leak problem. After starting the program, the GPU memory usage keeps increasing until ‘out-of-memory’. Some key codes are as following! Every ‘self.bert_model.get_bert_feature()’ executed, the GPU memory increased. I did simple debugging, and maybe the problem caused by the ‘class BertEmbeddings.forward()’. My pytorch version is 0.4.0, py3. Waiting for your reply, thanks very much!
class BertModel(PreTrainedBertModel):
def __init__(self, config):
super(BertModel, self).__init__(config)
self.embeddings = BertEmbeddings(config)
self.encoder = BertEncoder(config)
self.pooler = BertPooler(config)
self.apply(self.init_bert_weights)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=False):
#logger.info('bert forward')
if attention_mask is None:
attention_mask = torch.ones_like(input_ids)
if token_type_ids is None:
token_type_ids = torch.zeros_like(input_ids)
# We create a 3D attention mask from a 2D tensor mask.
# Sizes are [batch_size, 1, 1, to_seq_length]
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
# this attention mask is more simple than the triangular masking of causal attention
# used in OpenAI GPT, we just need to prepare the broadcast dimension here.
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
embedding_output = self.embeddings(input_ids, token_type_ids)
encoded_layers = self.encoder(embedding_output,
extended_attention_mask,
output_all_encoded_layers=output_all_encoded_layers)
return encoded_layers
class Bert_Instance(object):
def __init__(self, vocab_file, bert_model_path, device):
#tokenizer = BertTokenizer.from_pretrained(args.bert_model, do_lower_case=args.do_lower_case)
self.tokenizer = BertTokenizer(vocab_file)
self.model = BertModel.from_pretrained(bert_model_path)
self.device = device
print ('bert_device=', self.device)
self.model.to(self.device)
self.model.eval()
for para in self.model.parameters():
para.requires_grad = False
def get_feature(self, text_list, max_seq_length=50, layer=-1):
'''
Args:
text_list is a list to store the sentences, length is the sentence_number
Return:
(batch_size, seq_len+2, hidden_size)
'''
# a list, each dict element key is (ex_index, tokens, input_ids, input_mask, input_type_ids)
all_features = convert_examples_to_features(examples=text_list,
max_seq_length=max_seq_length,
tokenizer=self.tokenizer)
all_input_ids = torch.tensor([f['input_ids'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)
all_input_mask = torch.tensor([f['input_mask'] for f in all_features]).type(torch.cuda.LongTensor).to(self.device)
all_encoder_layers = self.model(all_input_ids,
token_type_ids=None,
attention_mask=all_input_mask)
return all_encoder_layers, all_input_mask
class Bert_Model(object):
def __init__(self, device):
self.bert_model = Bert_Instance(BERT_VOCAB, BERT_MODEL, device)
self.device = device
self.zp_pre_cache = {}
self.zp_post_cache = {}
self.candi_np = {}
self.cache = {'zp_pre': self.zp_pre_cache,
'zp_post': self.zp_post_cache,
'candi_np': self.candi_np}
def get_bert_feature(self, text_list, cache_name, batch_id, max_seq_length=30, layer=-1):
if batch_id in self.cache[cache_name].keys():
#res = torch.tensor(self.cache[cache_name][batch_id]).type(torch.cuda.FloatTensor).to(self.device)
res = self.cache[cache_name][batch_id]
return res
else:
res = self.bert_model.get_feature(text_list, max_seq_length, layer)
self.cache[cache_name][batch_id] = res
return res
class Experiment(object):
def __init__(self):
# load training data
with open(DIR+"data/train_data", "rb") as fin1, \
open(DIR+"data/emb","rb") as fin2:
self.train_generator = cPickle.load(fin1)
self.embedding_matrix, _ , _ = cPickle.load(fin2, encoding='iso-8859-1')
# load test data
self.test_generator = DataGenerator("test", 256)
self.dev_data = self.train_generator.generate_dev_data()
self.test_data = self.test_generator.generate_data()
# declare model architecture
self.model = Network(nnargs["embedding_size"], nnargs["embedding_dimension"], self.embedding_matrix, nnargs["hidden_dimension"], 2).to(NET_DEVICE)
self.bert_model = Bert_Model(BERT_DEVICE)
this_lr = 0.003
self.optimizer = optim.Adagrad(self.model.parameters(), lr = this_lr)
self.best = {"sum":0.0, "test_f":0.0, "best_test_f":0.0}
self.dropout = nnargs["dropout"]
def forward_step(self, data, mode, dropout=0.0):
zp_relative_index, zp_pre, zp_pre_mask, zp_post, zp_post_mask, candi_np, candi_np_mask, feature, zp_pre_words, zp_post_words, candi_np_words, batch_id = data2tensor(data)
batch_id = mode + '_' + str(batch_id)
zp_pre_bert, _ = self.bert_model.get_bert_feature(zp_pre_words, 'zp_pre', batch_id)
zp_post_bert, _ = self.bert_model.get_bert_feature(zp_post_words, 'zp_post', batch_id)
candi_np_bert, _ = self.bert_model.get_bert_feature(candi_np_words, 'candi_np', batch_id)
.....
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (4 by maintainers)
Maybe use the
torch.no_grad()context-manager which is the recommended way to perform inference with PyTorch now? See https://pytorch.org/docs/stable/autograd.html#torch.autograd.no_gradI have the newest version of pytorch and transformers, yes.
I have been monitoring the memory usage over 24h when I made ~ 300.000 requests. It seems that the memory increases constantly for quite some time but also seems to stabilize at a certain maximum. So the application started using ~2.5GB RAM and now stays at ~4.3GB.
Maybe it has something to do with varying lengths of the texts I process? So that the longest texts are processed at a later point in time which then require the most RAM. Then, any subsequent text cannot need more so it stabilizes. Though this is just a thought.
Thanks already for your help, I’m off to Christmas vacations for now and will have a look at the issue in January again. I’ll see if memory usage increases by then.
So I tried it with
bert-base-multilingual-uncasedas well and it is the same behavior. I do not understand, why memory constantly grows on inference. To my understanding, I only push data through the network and then use the result layer’s output. Before using the transformers, I had been using custom word embeddings trained in own keras models and I did not have this behavior. What am I missing here?