MidiTok: When using REMI for tokenization with use_time_signatures=True, many duplicate measures can be encoded.

miditok version

2.1.5

Problem summary

When using REMI for tokenization with use_time_signatures=True, many duplicate measures can be encoded.

Steps to reproduce

from miditok import REMI, REMIPlus, TokenizerConfig, MIDITokenizer config = TokenizerConfig(use_tempos=True, nb_tempos=240, tempo_range=(20, 259), use_rests=True, use_time_signatures=True, time_signature_range={8: (3,12), 4: (1,6)}, use_programs=False) tokenizer = REMI(config) tokens = tokenizer(midi_path) print(“MIDI tokens: \n”, tokens)

Expected vs. actual behavior

MIDI tokens: [TokSequence(tokens=[[…, ‘Velocity_79’, ‘Duration_1.0.8’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Position_0’, ‘Position_3’, ‘Pitch_37’, ‘Velocity_79’, ‘Duration_1.0.8’, ‘Position_7’, ‘Pitch_66’, ‘Velocity_79’, ‘Duration_2.0.8’, ‘Position_11’, …]

debug

/usr/local/lib/python3.10/dist-packages/miditok/tokenizations/remi.py: line.101

            if event.time != previous_tick:
                # (Rest)
                if (
                    self.config.use_rests
                    and event.time - previous_note_end >= self._min_rest
                ):
                    previous_tick = previous_note_end
                    rest_values = self._ticks_to_duration_tokens(
                        event.time - previous_tick, rest=True
                    )
                    for dur_value, dur_ticks in zip(*rest_values):
                        all_events.append(
                            Event(
                                type="Rest",
                                value=".".join(map(str, dur_value)),
                                time=previous_tick,
                                desc=f"{event.time - previous_tick} ticks",
                            )
                        )
                        previous_tick += dur_ticks
                    current_bar = previous_tick // ticks_per_bar

if previous_note_end > event.time, the current_bar will not be updated

Also, I found that the number of measures in the tokens obtained by MIDI tokenizer is less than expected. For example, I have a MusicXML file with 210 measures, which I converted to MIDI using MuseScore, and it should have retained information such as tempo and time signature. However, after using REMI tokenizer, I only got 201 measures in the tokens.

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 24 (12 by maintainers)

Most upvoted comments

This problem usually occurs in MIDI files with a large number of measures. And when I tried to fix this bug (update current_bar), the REMI tokenizer got 107 measures, but there are 124 measures in the original MXL file. 6354774_Macabre Waltz.zip