MidiTok: Slow Performance of `tokenize_midi_dataset` Function
I have noticed that there is a significant performance gap between two different scripts I am using to tokenize my dataset. The first script, which filters MIDI files and handles saving/loading manually, operates at approximately 300 iter/s. In comparison, the second script utilizing tokenize_midi_dataset
function operates at a much slower pace of only 15x slower (around 20 iter/s) than the manual implementation.
I think the tokenize_midi_dataset
function doesn’t take advantage of all cores.
Note that the filter midi script saves midi while the tokenizer_midi_dataset
saves json. Also the filter midi script utilizes all the cores available.
Filter Midi Script
import os
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
from jsonargparse import CLI
from miditok import REMI, TokenizerConfig
from symusic import Score
from tqdm.auto import tqdm
def process_midi(file_path, input_dir, output_dir):
"""
Process a single MIDI file. Tokenize it and save to the output directory maintaining the directory structure.
:param file_path: Path to the MIDI file.
:param input_dir: Base input directory.
:param output_dir: Base output directory.
"""
try:
# Read the MIDI file
midi_obj = Score.from_file(file_path)
# Initialize the tokenizer
tokenizer = REMI(
TokenizerConfig(
use_tempos=True,
use_programs=True,
use_time_signatures=True,
use_chords=True,
use_rests=True,
one_token_stream_for_programs=True,
special_tokens=["PAD", "BOS", "EOS"],
)
)
# Tokenization (example usage, can be adjusted based on how you want to use the tokenizer)
_ = tokenizer(midi_obj)
# Constructing the new path
relative_path = os.path.relpath(file_path, input_dir)
new_path = os.path.join(output_dir, relative_path)
# Create directories if they don't exist
os.makedirs(os.path.dirname(new_path), exist_ok=True)
# Save the tokenized file (assuming it's a MIDI object, adjust if the format is different)
midi_obj.dump_midi(new_path)
except Exception as e:
print(f"Error processing {file_path}: {e}")
def main(input_dir: str, output_dir: str, max_workers: int | None = None):
"""
Process all MIDI files in the input directory in parallel, and save them in the output directory.
"""
# List all MIDI files in the directory tree
midi_files = [str(p) for p in Path(input_dir).rglob("*.mid")]
# Process each MIDI file in parallel
with ProcessPoolExecutor(max_workers=max_workers) as executor:
# Create a list to hold the future results
future_to_midi = {
executor.submit(process_midi, midi_file, input_dir, output_dir): midi_file
for midi_file in midi_files
}
# Iterate through the futures as they complete (as_completed)
for future in tqdm(
as_completed(future_to_midi),
total=len(midi_files),
desc="Processing MIDI files",
):
file = future_to_midi[future]
try:
result = future.result()
if result:
# Progress will be updated automatically by tqdm
pass
except Exception as exc:
print(f"{file} generated an exception: {exc}")
if __name__ == "__main__":
CLI(main, as_positional=False)
Tokenize Midi Script
from dataclasses import dataclass
from pathlib import Path
from jsonargparse import CLI
from miditok import REMI, TokenizerConfig
@dataclass
class Config:
data_dir: str
output_dir: str
def cli_main(config: Config) -> None:
# Initialize tokenizer
tokenizer = REMI(
TokenizerConfig(
use_tempos=True,
use_programs=True,
use_time_signatures=True,
use_chords=True,
use_rests=True,
one_token_stream_for_programs=True,
special_tokens=["PAD", "BOS", "EOS"],
)
)
# Tokenize a whole dataset and save it at Json files
nobpe_output_dir = Path(f"{config.output_dir}/tokens_noBPE")
midi_paths = list(Path(config.data_dir).glob("**/*.mid"))
tokenizer.tokenize_midi_dataset(midi_paths, nobpe_output_dir)
if __name__ == "__main__":
config = CLI(Config, as_positional=False)
cli_main(config)
About this issue
- Original URL
- State: open
- Created 4 months ago
- Comments: 31 (18 by maintainers)
Interesting. If I get some time I’ll test this and share the results with you.
That will help reduce complexity a lot. But I assume for bpe you will want to pre-train the tokenizer first.
Do you have an idea for how one might extract the time say for the start of the segment and end of the segment? I do think this would be quite possible when tokenizing on the fly as we have access to the midi file itself.
BTW awesome job with miditok, it has been quite instrumental in my work.