MidiTok: ValueError: invalid literal for int() with base 10: '3.6.8' OR ValueError: not enough values to unpack (expected 2, got 1)

First of all, using the framework has been very useful already!

I am having two kinds of errors and don’t know why. I use GPT2 architecture (in repository example notebook) successfully trained and Miditok 1.1.9.

Code structure

Encoding:

pitch_range = range(21, 109)
beat_res = {(0, 4): 8}
nb_velocities = 32
additional_tokens = {'Chord': False, 'Rest': False, 'Tempo': True, 'Program': True, 'TimeSignature': True,
                     'nb_tempos': 32,
                     'tempo_range': (40, 250),
                     'time_signature_range': (8, 2)}
tokenizer = Octuple(pitch_range, beat_res, nb_velocities, additional_tokens)

Preprocessing:

# Converts MIDI files to tokens saved as JSON files
tokenizer.tokenize_midi_dataset(paths, relative_path_to_json, midi_valid)

json_paths = list(path.Path(relative_path_to_json).glob('*.json'))
entire_pop909_json_with_bools = []

for json_file in json_paths:
    with open(json_file) as f:
        data = json.load(f)
        entire_pop909_json_with_bools.extend(data) # where elements are found in the list of lists

entire_pop909_json_list = []
# just take song tokens, not boolean track signs
for slot in entire_pop909_json_with_bools:
    if False not in slot[0]: # TAKE CARE: just for Pop909 dataset
        entire_pop909_json_list.append(slot)

flatten_different_songs = [item for sublist in entire_pop909_json_list for item in sublist]
# just trying to make token units to fit the [4, 1024] shape, otherwise it would be [4, 1024, 8]
flatten_time_steps = [item for sublist in flatten_different_songs for item in sublist]

train_data = []
train_data.extend(flatten_time_steps)

Output tensors shape from DataLoader:

Train loader
X shape: torch.Size([4, 1024])
Target shape: torch.Size([4, 1024])

Generating from scratch:

rand_seq = model.generate(torch.Tensor([1]), target_seq_length=512)
out = rand_seq[0].cpu().numpy().tolist()

converted_back_midi = tokenizer.tokens_to_midi([out], None)
converted_back_midi.dump('output.mid')

Errors

When the generating part is executed two kinds of errors could show, this one:

MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_5234/3425966451.py in <module>
     14 out = rand_seq[0].cpu().numpy().tolist()
     15 
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
     17 converted_back_midi.dump('4_model_1_OUTPUT(256).mid')
     18 

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    230 
    231         if self.additional_tokens['TimeSignature']:
--> 232             time_sig = self._parse_token_time_signature(self.tokens_to_events(tokens[0])[-1].value)
    233         else:  # default
    234             time_sig = TIME_SIGNATURE

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _parse_token_time_signature(token_time_sig)
    447         :return: the numerator and denominator of a time signature
    448         """
--> 449         numerator, denominator = map(int, token_time_sig.split('/'))
    450         return numerator, denominator
    451 

ValueError: invalid literal for int() with base 10: '3.6.8'

Or this one:

MidiTok Model Generator
Generating sequence of max length: 512
50 / 512
100 / 512
150 / 512
200 / 512
250 / 512
300 / 512
350 / 512
400 / 512
450 / 512
500 / 512

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_5234/761086941.py in <module>
     14 out = rand_seq[0].cpu().numpy().tolist()
     15 
---> 16 converted_back_midi = tokenizer.tokens_to_midi([out], None)
     17 converted_back_midi.dump('output.mid')
     18 

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/octuple.py in tokens_to_midi(self, tokens, _, output_path, time_division)
    230 
    231         if self.additional_tokens['TimeSignature']:
--> 232             time_sig = self._parse_token_time_signature(self.tokens_to_events(tokens[0])[-1].value)
    233         else:  # default
    234             time_sig = TIME_SIGNATURE

~/miniconda3/envs/remiTest/lib/python3.9/site-packages/miditok/midi_tokenizer_base.py in _parse_token_time_signature(token_time_sig)
    447         :return: the numerator and denominator of a time signature
    448         """
--> 449         numerator, denominator = map(int, token_time_sig.split('/'))
    450         return numerator, denominator
    451 

ValueError: not enough values to unpack (expected 2, got 1)

The ValueError: invalid literal for int() with base 10: '3.6.8' one can be ‘x.x.x’ literal, it can change in every execution.

Thanks in advance!

PS: Sorry if I made it too long, just wanted to be clear on each point 😃.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 24 (12 by maintainers)

Most upvoted comments

Hi Env,

After a few tests I did not run into bugs, so I released the update in v1.2.0 ! If you get new bugs / crashes please re-open this issue or create a new one ! 😃

BTW Octuple is pretty “demanding” in computer ressource, meaning the multi input / output requires a relatively high amount of model parameters (and therefore gpu). The original authors used 8 V100 (32GB vram), which is quite a lot. My results with one V100 weren’t very good also, the model often producing errors like predicting Bar / Positions already passed (going backward in time). For smaller hardware / model size, representations like REMI / Structured are more suitable.

Amazing!

I will try to train for many epochs in an Amazon GPU, so when I have results I can tell you. The new version seems to work well!

Hi 😃

I just corrected it in 960cbfa8eac1750aec1fb95d623e3ab2a51370f1 (a really stupid bug ahah) Hoping to not find any other bugs, if so please tell me. And if after testing with generated tokens you don’t encounter any other bug, please also report it so that I release this in the next version. 😃

I get what you say. Maybe that’s deeper to build than I could, but I will try my best these days 😄

I’ll report everything I find 👍

You were right. I used your transformer class and octuple with some changes, and it seems to work now. The only problem is that I am struggling with the predict function, how would you use it to generate a sequence without primer melody?

PS: Should I open another issue for this? Is this one too far?

Yes I am 99% confident this error was caused by the “flattening”.

If this takes you too much time, maybe you could just switch to a 1D representation (Remi, Structured etc). I have currently things running, when it’s done I’ll try this version of octuple.

If this can help you, here is how to compute the several losses

# x is the input sequence, of shape (N,T,Z), T is sequence length, N batch size, Z the different token types
# y are the output logits, is a list of Z tensors of shape (T,N,C*) where C is the vocabulary size, and will vary depending on the token type (pitch, velocity etc...)
losses = []
for j in range(len(tokenizer.vocab)):
    losses.append(criterion(y[j].permute(1, 2, 0), x[..., j]))  # shapes (N,C,T) and (N,T), see Pytorch cross-entropy for details
loss = sum(losses)  # here we sum, but we could also have mean for instance

Thank you !

By tokens I am referring to a token sequence produce by the model (list of list of integers in the case of octuple).

I looked at the GPT2Model from hugging face, and the problem (for us here) is that it automatically comes with an Embedding layer, so it can’t be used with multi embeddings.

But if you are using PyTorch, the Transformer module is almost exactly the same. Here is how to create the model, with multi input / output modules for octuple: (I did not test it as here, I just wrote this from code blocks I had)

from typing import Optional, List
from math import log

import torch
from torch.nn import Module, Linear, Embedding, ModuleList, Transformer, Dropout
from torch.nn.init import xavier_uniform_
from torch import Tensor, cat, no_grad, triu, ones, stack


class MyTransformer(Module):
    def __init__(self, num_layers: int, num_classes: List[int], d_model: int, nhead: int,
                 dim_feedforward: int, max_seq_len: int, embedding_sizes: List[int] = None,
                 dropout: float = 0.1, layer_norm_eps: float = 1e-5, device: torch.device = torch.device('cpu'),
                 padding_token: int = 0):
        super().__init__()
        head_dim, rest = divmod(d_model, nhead)
        assert rest == 0, f'Non valid combination of model dimension ({d_model}) and number of heads ({nhead})'
        self.device = device

        # POSITIONAL ENCODING
        self.pos_enc = AbsolutePositionalEncoding(d_model, max_seq_len)

        # Input module
        self.embedder = MultiEmbeddings(num_classes, embedding_sizes, d_model, padding_token)

        # Transformer
        self.transformer = Transformer(d_model=d_model, nhead=nhead, num_encoder_layers=num_layers,
                                       num_decoder_layers=0, dim_feedforward=dim_feedforward, dropout=dropout,
                                       layer_norm_eps=layer_norm_eps, device=self.device)

        # Output module
        self.to_logits = MultiOutput(num_classes, d_model)

        # INITIALIZATION
        for p in self.parameters():
            if p.dim() > 1:
                xavier_uniform_(p)
        self.to(self.device)

    def forward(self, tgt: Tensor, attn_mask: Optional[Tensor] = None, key_pad_mask: Optional[Tensor] = None,
                causal: bool = False):
        """

        :param tgt:
        :param attn_mask:
        :param key_pad_mask:
        :param causal: causal attention, will quickly compute attention with causality
        :return:
        """
        if attn_mask is None and causal:
            tgt_len = tgt.shape[0]
            attn_mask = triu(ones(tgt_len, tgt_len) * float('-inf'), diagonal=1).to(self.device)  # CAUSAL MASK

        tgt = self.embedder(tgt)  # (T,N) -> (T,N,E)
        tgt = self.pos_enc(tgt)
        tgt = self.transformer(tgt, attn_mask, src_key_padding_mask=key_pad_mask, causal=causal)
        tgt = self.to_logits(tgt)  # (T,N,E) -> list of (T,N,C), C is variable and depends on vocab sizes
        return tgt

    @no_grad()
    def predict(self, x: Tensor, inference_lim: int, max_seq_len: int, top_k: int) -> Tensor:
        """ Prediction function for inference

        :param x: input tensor (N,T,Z) if multi-input embedding
        :param inference_lim: number of inferences
        :param max_seq_len: maximum sequence length being process (attention context size)
        :param top_k: top k sampling value
        :return: the predicted sequence
        """
        x = x.transpose(1, 0).to(self.device)  # (N,T,) --> (T,N,)

        try:
            for _ in range(inference_lim):
                # Adds the prediction to the target sequence, updates the time values
                y = self.forward(x[-max_seq_len:], causal=True)  # list of Z (T,N,C*)
                y = stack([top_k_sampling(type_[-1], top_k) for type_ in y]).t()  # (N,Z)
                x = cat([x, y.unsqueeze(0)])  # (T+1,N,Z)
        except KeyError:  # bar embedding to high
            pass
        return x.transpose(1, 0)  # (N,T,)


class MultiEmbeddings(Module):
    """Multi-input module, taking several tokens as input, converting them to embeddings and
    concatenate them to make a single 'merged' embedding

    :param num_classes: number of classes for each token type
    :param embedding_sizes: sizes of each embedding type
    :param d_model: size of the final embedding, i.e. dimension of the transformer
    :param padding_idx: padding index, must be the same for each token type
    """
    def __init__(self, num_classes: List[int], embedding_sizes: List[int], d_model: int, padding_idx: int = 0):
        assert len(num_classes) == len(embedding_sizes), \
            f'The number of classes and embedding sizes must be the same ({len(num_classes)} and ' \
            f'{len(embedding_sizes)} were given)'
        super().__init__()
        self.embedding_layers = ModuleList([Embedding(num_classes[i], embedding_sizes[i], padding_idx)
                                            for i in range(len(num_classes))])
        self.proj = Linear(sum(embedding_sizes), d_model)

    def forward(self, x) -> Tensor:
        """

        :param x: Tokens sequences, shape: (L, N, Z)
        :return: Embeddings, as a tensor with a shape (L, N, E)
        """
        embeds = []
        for i, mod in enumerate(self.embedding_layers):
            embeds.append(mod(x[:, :, i]))
        x = cat(embeds, dim=-1)  # (L, N, sum(embedding_sizes))
        return self.proj(x)  # (L, N, E)


class MultiOutput(Module):
    """Multi-output module.

    :param num_classes: number of classes for each token type
    :param d_model: size of the final embedding, i.e. dimension of the transformer
    """
    def __init__(self, num_classes: List[int], d_model: int):
        super().__init__()
        self.output_layers = ModuleList([Linear(d_model, num) for num in num_classes])

    def forward(self, x) -> List[Tensor]:
        """

        :param x: Tokens sequences, shape: (L, N, E)
        :return: List of tensors of shape (L, N, *)
        """
        return [out(x) for out in self.output_layers]  # (L, N, *)


class AbsolutePositionalEncoding(Module):
    """ Module injecting positional information in the embeddings of a sequence.
    To be used at the beginning of a transformer network, before the first layers.

    :param d_model: embedding size
    :param max_len: max length of the sequences that will be treated
    :param dropout: dropout value
    """

    def __init__(self, d_model: int, max_len: int, dropout: float = 0.1):
        super().__init__()
        self.dropout = Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        """ Adds positional encoding to a sequence

        :param x: input tensor, shape (sequence length, batch size, embedding size)
        :return the tensor with positional encoding
        """

        x = x + self.pe[:x.size()[0], :].to(x.device, dtype=x.dtype)
        return self.dropout(x)


def top_k_sampling(x: Tensor, k: int, temperature: int = None) -> Tensor:
    """Top K sampling

    :param x: input tensor of shape (N,C) or (T,N,C)
    :param k: k factor
    :param temperature: temperature for softmax
    :return: sampling results as (N)
    """
    x_copy = x.clone() / temperature if temperature is not None else x.clone()
    indices_to_inf = x < torch.topk(x, k)[0][..., -1, None]
    x_copy[indices_to_inf] = float('-inf')
    if x.dim() == 2:  # (N,C)
        return torch.multinomial(torch.softmax(x_copy, -1), 1).squeeze(-1)
    elif x.dim() == 3:  # (T,N,C)
        return stack([torch.multinomial(torch.softmax(xi, -1), 1).squeeze(-1) for xi in x_copy])

And the create the model :

embedding_sizes=[256, 128, 128, 192, 64, 128, 128, 64]  # I just put random number, the choice is up to you
model = MyTransformer(num_layers=4, num_classes=[len(voc) for voc in tokenizer.vocab], d_model=256, dim_feedforward=1024, max_seq_len=1024, embedding_sizes=embedding_sizes)

For your last question, by 443 and 580 do you mean the sum of the vocabularies ? And yes the size can change between different datasets, as the durations of files would be different, the length of the Bar vocab would also be different.

Let me try this week and whithin some days hopefully I could give you the results.

Hi @envilk, thanks for your comment and for this bug report ! I’ll look into it in the next few days to fix it.

My guess is that the decoded token is not of type TimeSignature (3.6.8 looks like a Duration token). A check might solve it.

Also for Octuple, CP Word and MuMIDI tokenizations, I will soon give an update so that each tokenizer have several vocabularies, one for each token type. This allows to more easily create Embedding layers of appropriate sizes, and so that a model also returns several sequences of logits of the associated sizes.

Nathan