MidiTok: When using REMI for tokenization with use_time_signatures=True, many duplicate measures can be encoded.
miditok version
2.1.5
Problem summary
When using REMI for tokenization with use_time_signatures=True, many duplicate measures can be encoded.
Steps to reproduce
from miditok import REMI, REMIPlus, TokenizerConfig, MIDITokenizer config = TokenizerConfig(use_tempos=True, nb_tempos=240, tempo_range=(20, 259), use_rests=True, use_time_signatures=True, time_signature_range={8: (3,12), 4: (1,6)}, use_programs=False) tokenizer = REMI(config) tokens = tokenizer(midi_path) print(“MIDI tokens: \n”, tokens)
Expected vs. actual behavior
MIDI tokens: [TokSequence(tokens=[[…, ‘Velocity_79’, ‘Duration_1.0.8’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Bar_None’, ‘TimeSig_4/4’, ‘Position_0’, ‘Position_3’, ‘Pitch_37’, ‘Velocity_79’, ‘Duration_1.0.8’, ‘Position_7’, ‘Pitch_66’, ‘Velocity_79’, ‘Duration_2.0.8’, ‘Position_11’, …]
debug
/usr/local/lib/python3.10/dist-packages/miditok/tokenizations/remi.py: line.101
if event.time != previous_tick:
# (Rest)
if (
self.config.use_rests
and event.time - previous_note_end >= self._min_rest
):
previous_tick = previous_note_end
rest_values = self._ticks_to_duration_tokens(
event.time - previous_tick, rest=True
)
for dur_value, dur_ticks in zip(*rest_values):
all_events.append(
Event(
type="Rest",
value=".".join(map(str, dur_value)),
time=previous_tick,
desc=f"{event.time - previous_tick} ticks",
)
)
previous_tick += dur_ticks
current_bar = previous_tick // ticks_per_bar
if previous_note_end > event.time, the current_bar will not be updated
Also, I found that the number of measures in the tokens obtained by MIDI tokenizer is less than expected. For example, I have a MusicXML file with 210 measures, which I converted to MIDI using MuseScore, and it should have retained information such as tempo and time signature. However, after using REMI tokenizer, I only got 201 measures in the tokens.
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 24 (12 by maintainers)
This problem usually occurs in MIDI files with a large number of measures. And when I tried to fix this bug (update current_bar), the REMI tokenizer got 107 measures, but there are 124 measures in the original MXL file. 6354774_Macabre Waltz.zip