MidiTok: Slow Performance of `tokenize_midi_dataset` Function

I have noticed that there is a significant performance gap between two different scripts I am using to tokenize my dataset. The first script, which filters MIDI files and handles saving/loading manually, operates at approximately 300 iter/s. In comparison, the second script utilizing tokenize_midi_dataset function operates at a much slower pace of only 15x slower (around 20 iter/s) than the manual implementation.

I think the tokenize_midi_dataset function doesn’t take advantage of all cores.

Note that the filter midi script saves midi while the tokenizer_midi_dataset saves json. Also the filter midi script utilizes all the cores available.

Filter Midi Script

import os
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path

from jsonargparse import CLI
from miditok import REMI, TokenizerConfig
from symusic import Score
from tqdm.auto import tqdm


def process_midi(file_path, input_dir, output_dir):
    """
    Process a single MIDI file. Tokenize it and save to the output directory maintaining the directory structure.
    :param file_path: Path to the MIDI file.
    :param input_dir: Base input directory.
    :param output_dir: Base output directory.
    """
    try:
        # Read the MIDI file
        midi_obj = Score.from_file(file_path)

        # Initialize the tokenizer
        tokenizer = REMI(
            TokenizerConfig(
                use_tempos=True,
                use_programs=True,
                use_time_signatures=True,
                use_chords=True,
                use_rests=True,
                one_token_stream_for_programs=True,
                special_tokens=["PAD", "BOS", "EOS"],
            )
        )
        # Tokenization (example usage, can be adjusted based on how you want to use the tokenizer)
        _ = tokenizer(midi_obj)

        # Constructing the new path
        relative_path = os.path.relpath(file_path, input_dir)
        new_path = os.path.join(output_dir, relative_path)

        # Create directories if they don't exist
        os.makedirs(os.path.dirname(new_path), exist_ok=True)

        # Save the tokenized file (assuming it's a MIDI object, adjust if the format is different)
        midi_obj.dump_midi(new_path)

    except Exception as e:
        print(f"Error processing {file_path}: {e}")


def main(input_dir: str, output_dir: str, max_workers: int | None = None):
    """
    Process all MIDI files in the input directory in parallel, and save them in the output directory.
    """
    # List all MIDI files in the directory tree
    midi_files = [str(p) for p in Path(input_dir).rglob("*.mid")]

    # Process each MIDI file in parallel
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        # Create a list to hold the future results
        future_to_midi = {
            executor.submit(process_midi, midi_file, input_dir, output_dir): midi_file
            for midi_file in midi_files
        }

        # Iterate through the futures as they complete (as_completed)
        for future in tqdm(
            as_completed(future_to_midi),
            total=len(midi_files),
            desc="Processing MIDI files",
        ):
            file = future_to_midi[future]
            try:
                result = future.result()
                if result:
                    # Progress will be updated automatically by tqdm
                    pass
            except Exception as exc:
                print(f"{file} generated an exception: {exc}")


if __name__ == "__main__":
    CLI(main, as_positional=False)

Tokenize Midi Script

from dataclasses import dataclass
from pathlib import Path

from jsonargparse import CLI
from miditok import REMI, TokenizerConfig

@dataclass
class Config:
    data_dir: str
    output_dir: str


def cli_main(config: Config) -> None:
    # Initialize tokenizer
    tokenizer = REMI(
        TokenizerConfig(
            use_tempos=True,
            use_programs=True,
            use_time_signatures=True,
            use_chords=True,
            use_rests=True,
            one_token_stream_for_programs=True,
            special_tokens=["PAD", "BOS", "EOS"],
        )
    )

    # Tokenize a whole dataset and save it at Json files
    nobpe_output_dir = Path(f"{config.output_dir}/tokens_noBPE")
    midi_paths = list(Path(config.data_dir).glob("**/*.mid"))
    tokenizer.tokenize_midi_dataset(midi_paths, nobpe_output_dir)


if __name__ == "__main__":
    config = CLI(Config, as_positional=False)
    cli_main(config)

About this issue

Original URL
State: open
Created 4 months ago
Comments: 31 (18 by maintainers)

Most upvoted comments

Interesting. If I get some time I’ll test this and share the results with you.

That will help reduce complexity a lot. But I assume for bpe you will want to pre-train the tokenizer first.

Do you have an idea for how one might extract the time say for the start of the segment and end of the segment? I do think this would be quite possible when tokenizing on the fly as we have access to the midi file itself.

BTW awesome job with miditok, it has been quite instrumental in my work.

Kinyugo on Feb 20, 2024