espnet: type of argument "text" must be one of (Dict[str, Tuple], torch.Tensor, numpy.ndarray); got str instead
I am trying to use the SingingGenerate class from the espnet2.bin.svs_inference module to generate a singing voice for some text. However, I am getting a TypeError with the message “type of argument “text” must be one of (Dict[str, Tuple], torch.Tensor, numpy.ndarray); got str instead”. Here is my code:
import soundfile
from espnet2.bin.svs_inference import SingingGenerate
import torch
import numpy as np
svs = SingingGenerate("/content/opencpop_svs_train_visinger2_raw_phn_None_zh_latest/exp/44k/svs_train_visinger2_raw_phn_None_zh/config.yaml", "/content/opencpop_svs_train_visinger2_raw_phn_None_zh_latest/exp/44k/svs_train_visinger2_raw_phn_None_zh/100epoch.pth")
text = 'hello world'
wav = svs(text)[0]
soundfile.write("out.wav", wav.numpy(), svs.fs, "PCM_16")
And here is the error message I am getting:
<ipython-input-26-ed572d25f986> in <cell line: 10>()
8
9
---> 10 wav = svs(text)[0]
11 soundfile.write("out.wav", wav.numpy(), svs.fs, "PCM_16")
2 frames
/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py in decorate_context(*args, **kwargs)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
116
117 return decorate_context
/usr/local/lib/python3.10/dist-packages/espnet2/bin/svs_inference.py in __call__(self, text, singing, label, midi, duration_phn, duration_ruled_phn, duration_syb, phn_cnt, slur, pitch, energy, spembs, sids, lids, decode_conf)
136 decode_conf: Optional[Dict[str, Any]] = None,
137 ):
--> 138 assert check_argument_types()
139
140 # check inputs
/usr/local/lib/python3.10/dist-packages/typeguard/__init__.py in check_argument_types(memo)
873 check_type(description, value, expected_type, memo)
874 except TypeError as exc: # suppress unnecessarily long tracebacks
--> 875 raise TypeError(*exc.args) from None
876
877 return True
TypeError: type of argument "text" must be one of (Dict[str, Tuple], torch.Tensor, numpy.ndarray); got str instead
I have checked the documentation and it seems that passing a string as the text argument should work. Here is an example from the documentation:
>>> import soundfile
>>> svs = SingingGenerate("config.yml", "model.pth")
>>> wav = svs("Hello World")[0]
>>> soundfile.write("out.wav", wav.numpy(), svs.fs, "PCM_16")
Is there anything else you would like to know? 😊
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 22
If you check the input of any of the systems you provide, it does not only input a text, but a real-song to provide melodies. In that case, you can get to know the melody and text at least (additional alignment with melody and lyrics are needed). But still, they are not “text-to-song/lyric-to-song”, but instead "text+melody -> song
Not possible (and I believe no other toolkits can do it if you do not provide related melody)
Opencpop is for Mandarin, while Ofuton is for Japanese. Apology, but we do not have a good source of English dataset yet. If you have any recommendations, please let us know. We would want to support more languages if possible
I see. It is indeed problematic, I will fix that soon. Thanks for noting that.
For custom input, we have some instructions here https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/svs1#1-database-dependent-data-preparation, but you may also refer to our existing recipe on public datasets.
For using the framework, I would recommend you go through a recipe to get to know every stage including data preparation, training, inference, (possibly) evaluation.
The toolkit itself is OK for commercial use (we are under Apache license). However, the model and data might not be commercially granted, given the license of the dataset. I would recommend you to check the detailed license of the dataset or contact the creator for more guidance. As for us, we probably cannot get to offer any help regarding that.
We usually first test our models at Opencpop (https://github.com/espnet/espnet/tree/master/egs2/opencpop/svs1) and Ofuton (https://github.com/espnet/espnet/tree/master/egs2/ofuton_p_utagoe_db/svs1) as two well-contructed datasets. The link I provide should include the recipe to train the model.
We also have a general documentation maintained at https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/svs1#readme
https://espnet.github.io/espnet/_gen/espnet2.bin.html#espnet2-bin-svs-inference-1
Umm, I do not think we have documentation saying that text only would be ok for the scenario, Could you kindly let me know where you find the doc for that?
Hi, thanks for your interest in the toolkit. I think the documentation is kind of wrong as it was inherited from tts’s. We will be fixing that soon!
For singing synthesis, we would expect the music score and corresponding lyrics for synthesis. The correct inference procedure is at: https://github.com/espnet/espnet/blob/master/espnet2/bin/svs_inference.py#L419 (where it would take input as indicated in the function at https://github.com/espnet/espnet/blob/master/espnet2/svs/espnet_model.py#L451)