espnet: type of argument "text" must be one of (Dict[str, Tuple], torch.Tensor, numpy.ndarray); got str instead

I am trying to use the SingingGenerate class from the espnet2.bin.svs_inference module to generate a singing voice for some text. However, I am getting a TypeError with the message “type of argument “text” must be one of (Dict[str, Tuple], torch.Tensor, numpy.ndarray); got str instead”. Here is my code:

import soundfile
from espnet2.bin.svs_inference import SingingGenerate
import torch
import numpy as np
svs = SingingGenerate("/content/opencpop_svs_train_visinger2_raw_phn_None_zh_latest/exp/44k/svs_train_visinger2_raw_phn_None_zh/config.yaml", "/content/opencpop_svs_train_visinger2_raw_phn_None_zh_latest/exp/44k/svs_train_visinger2_raw_phn_None_zh/100epoch.pth")
text = 'hello world'


wav = svs(text)[0]
soundfile.write("out.wav", wav.numpy(), svs.fs, "PCM_16")

And here is the error message I am getting:

<ipython-input-26-ed572d25f986> in <cell line: 10>()
      8 
      9 
---> 10 wav = svs(text)[0]
     11 soundfile.write("out.wav", wav.numpy(), svs.fs, "PCM_16")

2 frames
/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py in decorate_context(*args, **kwargs)
    113     def decorate_context(*args, **kwargs):
    114         with ctx_factory():
--> 115             return func(*args, **kwargs)
    116 
    117     return decorate_context

/usr/local/lib/python3.10/dist-packages/espnet2/bin/svs_inference.py in __call__(self, text, singing, label, midi, duration_phn, duration_ruled_phn, duration_syb, phn_cnt, slur, pitch, energy, spembs, sids, lids, decode_conf)
    136         decode_conf: Optional[Dict[str, Any]] = None,
    137     ):
--> 138         assert check_argument_types()
    139 
    140         # check inputs

/usr/local/lib/python3.10/dist-packages/typeguard/__init__.py in check_argument_types(memo)
    873                 check_type(description, value, expected_type, memo)
    874             except TypeError as exc:  # suppress unnecessarily long tracebacks
--> 875                 raise TypeError(*exc.args) from None
    876 
    877     return True

TypeError: type of argument "text" must be one of (Dict[str, Tuple], torch.Tensor, numpy.ndarray); got str instead

I have checked the documentation and it seems that passing a string as the text argument should work. Here is an example from the documentation:

    >>> import soundfile
    >>> svs = SingingGenerate("config.yml", "model.pth")
    >>> wav = svs("Hello World")[0]
    >>> soundfile.write("out.wav", wav.numpy(), svs.fs, "PCM_16")

Is there anything else you would like to know? 😊

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 22

Most upvoted comments

If you check the input of any of the systems you provide, it does not only input a text, but a real-song to provide melodies. In that case, you can get to know the melody and text at least (additional alignment with melody and lyrics are needed). But still, they are not “text-to-song/lyric-to-song”, but instead "text+melody -> song

Not possible (and I believe no other toolkits can do it if you do not provide related melody)

Opencpop is for Mandarin, while Ofuton is for Japanese. Apology, but we do not have a good source of English dataset yet. If you have any recommendations, please let us know. We would want to support more languages if possible

Umm, I do not think we have documentation saying that text only would be ok for the scenario, Could you kindly let me know where you find the doc for that?

https://espnet.github.io/espnet/_gen/espnet2.bin.html#espnet2-bin-svs-inference-1

I see. It is indeed problematic, I will fix that soon. Thanks for noting that.

Creating Custom input: I’m intrigued by the possibility of creating my own input for singing synthesis. Could you provide some pointers or resources that would help me get started with creating custom inputs? It would be wonderful to learn more about this aspect of the toolkit.

Best Practices and Models: In your experience, what would be the best approach or model for creating music from lyrics using the toolkit? Are there any specific procedures or techniques that you would recommend for someone starting out? Any insights you can provide would be invaluable.

For custom input, we have some instructions here https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/svs1#1-database-dependent-data-preparation, but you may also refer to our existing recipe on public datasets.

For using the framework, I would recommend you go through a recipe to get to know every stage including data preparation, training, inference, (possibly) evaluation.

In reference to your previous message:

Commercial Use: I’m seeking clarification on whether the toolkit and its singing synthesis functionality are suitable for commercial use. If you could provide insights into any licensing considerations related to commercial applications, that would be extremely valuable.

Music Score and Lyrics: Your explanation about the necessity of music scores and corresponding lyrics for synthesis was enlightening. Could you recommend any preferred sources or datasets where I can access these music scores and lyrics for my experimentation? If there’s a particular dataset that the toolkit developers suggest, I’d be grateful for the information.

The toolkit itself is OK for commercial use (we are under Apache license). However, the model and data might not be commercially granted, given the license of the dataset. I would recommend you to check the detailed license of the dataset or contact the creator for more guidance. As for us, we probably cannot get to offer any help regarding that.

We usually first test our models at Opencpop (https://github.com/espnet/espnet/tree/master/egs2/opencpop/svs1) and Ofuton (https://github.com/espnet/espnet/tree/master/egs2/ofuton_p_utagoe_db/svs1) as two well-contructed datasets. The link I provide should include the recipe to train the model.

We also have a general documentation maintained at https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/svs1#readme

Umm, I do not think we have documentation saying that text only would be ok for the scenario, Could you kindly let me know where you find the doc for that?

https://espnet.github.io/espnet/_gen/espnet2.bin.html#espnet2-bin-svs-inference-1

Umm, I do not think we have documentation saying that text only would be ok for the scenario, Could you kindly let me know where you find the doc for that?

Hi, thanks for your interest in the toolkit. I think the documentation is kind of wrong as it was inherited from tts’s. We will be fixing that soon!

For singing synthesis, we would expect the music score and corresponding lyrics for synthesis. The correct inference procedure is at: https://github.com/espnet/espnet/blob/master/espnet2/bin/svs_inference.py#L419 (where it would take input as indicated in the function at https://github.com/espnet/espnet/blob/master/espnet2/svs/espnet_model.py#L451)