openai-python: TTS streaming does not work
Confirm this is an issue with the Python library and not an underlying OpenAI API
- This is an issue with the Python library
Describe the bug
When following the documentation on how to use client.audio.speech.create(), the returned response has a method called stream_to_file(file_path) which explains that when used, it should stream the content of the audio file as it’s being created. This does not seem to work. I’ve used a rather large text input that generates a 3.5 minute sound file and the file is only created once the whole request is completed.
To Reproduce
Utilize the following code and replace the text input with a decently large amount of text.
from pathlib import Path
from openai import OpenAI
client = OpenAI()
speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="""
<Decently large bit of text here>
"""
)
response.stream_to_file(speech_file_path)
Notice that when the script is run that the speech.mp3 file is only ever created after the request is fully completed.
Code snippets
No response
OS
macOS
Python version
Python 3.11.6
Library version
openai v1.2.4
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Reactions: 3
- Comments: 37
Once it is fixed, would this allow us to send streamed response from
chat.completions.createto theaudio.speech.createand get a lazily-reading behavior as text response is streamed in chunks from chat completion?Nice @cyzanfar !
In case anyone comes across this thread looking for a fully working solution, written in Python, with a sample Flask app and an audio player:
And this is indeed working beautifully!
An example on how to stream TTS response to your speakers with PyAudio is now available here: https://github.com/openai/openai-python/blob/e41abf7b7dbc1e744d167f748e55d4dedfc0dca7/examples/audio.py#L40-L60
Seems like everyone just wants to be able to pipe the incoming text streaming chunks into the speech create function and for it to start generating speech as soon as a certain threshold is reached (probably a whole sentence, so it sounds natural). Hope this feature is added soon.
Well it’s very simple: just return audio/mpeg from the http server, and stream the response. The browsers handle that by showing a audio player so didn’t need to do anything on the browser side.
I used FastAPI, so the http handler func is:
This requires the stream param in speech.create, so either my quick hack PR https://github.com/openai/openai-python/pull/724 or the later more proper one from OpenAI, https://github.com/openai/openai-python/pull/866 . I guess that one works too even though they reverted it now.
I have this live at https://id-longtask-3q4kbi7oda-uc.a.run.app/stream . First query to the server takes time as it starts up the instance, but it starts playing audio in about 1s in later requests. Don’t bomb it too much so I don’t need to take it down due to much API usage…
I got the output of OAI TTS to stream. Here’s an example:
Hope that helps!
Cool, thanks. It doesn’t seem to use streaming to get the audio from tts, but maybe it’s fine if getting the chunks as whole is fast enough.
@rattrayalex thank you Alex, that example works for me!
I spent some time trying to figure out why
response_format="wav"format gives a faster TTFB thanpcm, and I found that only the WAV is actually sent more quickly, but the first audio data takes roughly the same time to arrive.It might be nice to use the header to set the pyaudio options instead of hardcoding them, but I couldn’t figure out how to do that (didn’t manage to turn the response into something I can feed into
wave.open()).Hey @nimobeeren, thanks for the repro. On my computer this does seem to be streaming to the file as expected 😦. Would you be able to provide what system you’re using (it shouldn’t make a difference, but just in case) and how you were able to determine that the file wasn’t streaming?
of course, thank you @RobertCraigie! I’m not sure about how to handle an exception from OpenAI in this case, but this is a great start.
for reference:
Ah, I’m sorry to say the fix in #866 is being reverted; upon further discussion, we’ve found a better way that should be available in the coming days. Thank you for your patience.
Seems to be fixed in https://github.com/openai/openai-python/pull/866
I hacked it in earlier in https://github.com/openai/openai-python/pull/724 and results were good, when I just passed the stream=True parameter in … even for a long (like 30s or more) text, i got to start hearing it in a browser client in about 1s.
@RobertCraigie here seems to say, though, that it only starts when the whole audio is completed on the server? Apparently the generation is quick, then?