openai-python: TTS streaming does not work

Confirm this is an issue with the Python library and not an underlying OpenAI API

  • This is an issue with the Python library

Describe the bug

When following the documentation on how to use client.audio.speech.create(), the returned response has a method called stream_to_file(file_path) which explains that when used, it should stream the content of the audio file as it’s being created. This does not seem to work. I’ve used a rather large text input that generates a 3.5 minute sound file and the file is only created once the whole request is completed.

To Reproduce

Utilize the following code and replace the text input with a decently large amount of text.

from pathlib import Path
from openai import OpenAI
client = OpenAI()

speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="""
        <Decently large bit of text here>
    """
)

response.stream_to_file(speech_file_path)

Notice that when the script is run that the speech.mp3 file is only ever created after the request is fully completed.

Code snippets

No response

OS

macOS

Python version

Python 3.11.6

Library version

openai v1.2.4

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 3
  • Comments: 37

Most upvoted comments

Once it is fixed, would this allow us to send streamed response from chat.completions.create to the audio.speech.create and get a lazily-reading behavior as text response is streamed in chunks from chat completion?

Nice @cyzanfar !

In case anyone comes across this thread looking for a fully working solution, written in Python, with a sample Flask app and an audio player:

from flask import Flask, Response, render_template_string
import requests

app = Flask(__name__)

@app.route('/')
def index():
    # HTML template to render an audio player
    html = '''
    <!DOCTYPE html>
    <html>
    <body>
    <audio controls autoplay>
        <source src="/stream" type="audio/mpeg">
        Your browser does not support the audio element.
    </audio>
    </body>
    </html>
    '''
    return render_template_string(html)

@app.route('/stream')
def stream():
    def generate():
        url = "https://api.openai.com/v1/audio/speech"
        headers = {
            "Authorization": 'Bearer YOUR_SK_TOKEN, 
        }

        data = {
            "model": "tts-1",
            "input": "YOUR TEXT THAT NEEDS TO BE TTSD HERE",
            "voice": "alloy",
            "response_format": "mp3",
        }

        with requests.post(url, headers=headers, json=data, stream=True) as response:
            if response.status_code == 200:
                for chunk in response.iter_content(chunk_size=4096):
                    yield chunk

    return Response(generate(), mimetype="audio/mpeg")

if __name__ == "__main__":
    app.run(debug=True, threaded=True)

And this is indeed working beautifully!

An example on how to stream TTS response to your speakers with PyAudio is now available here: https://github.com/openai/openai-python/blob/e41abf7b7dbc1e744d167f748e55d4dedfc0dca7/examples/audio.py#L40-L60

Seems like everyone just wants to be able to pipe the incoming text streaming chunks into the speech create function and for it to start generating speech as soon as a certain threshold is reached (probably a whole sentence, so it sounds natural). Hope this feature is added soon.

@antont out of curiosity, how did you play the streaming audio in the browser? Do you have example code you could share for the benefit of others?

Well it’s very simple: just return audio/mpeg from the http server, and stream the response. The browsers handle that by showing a audio player so didn’t need to do anything on the browser side.

I used FastAPI, so the http handler func is:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncStream
from openai.types.chat import ChatCompletionChunk

app = FastAPI()

@app.get("/stream")
async def stream():
    text = "test. "
    text *= 10 #NOTE: I tested with 100 here too to make sure that it streams, i.e. still starts quickly, in about 1s
    
    #was without stream=True param
    #speech_stream: HttpxBinaryResponseContent = await text_to_speech_stream_openai(text)
    speech_stream: AsyncStream[ChatCompletionChunk] = await text_to_speech_stream_openai(text) #is not the actual type

    return StreamingResponse(speech_stream.response.aiter_bytes(), media_type="audio/mpeg")
    
async def text_to_speech_stream_openai(text: str):
    print('Generating audio from text review using Open AI API')
    #without stream=True is: response: HttpxBinaryResponseContent
    stream: AsyncStream[ChatCompletionChunk] = await client.audio.speech.create(
        model="tts-1",
        voice="echo",
        input=text,
        stream=True
    ) #type: ignore
    #print(type(response), dir(response), response)
    print(type(stream), dir(stream), stream)
    return stream

This requires the stream param in speech.create, so either my quick hack PR https://github.com/openai/openai-python/pull/724 or the later more proper one from OpenAI, https://github.com/openai/openai-python/pull/866 . I guess that one works too even though they reverted it now.

I have this live at https://id-longtask-3q4kbi7oda-uc.a.run.app/stream . First query to the server takes time as it starts up the instance, but it starts playing audio in about 1s in later requests. Don’t bomb it too much so I don’t need to take it down due to much API usage…

I got the output of OAI TTS to stream. Here’s an example:

url = "https://api.openai.com/v1/audio/speech"
headers = {
    "Authorization": 'Bearer YOUR_API_KEY', 
}

data = {
    "model": model,
    "input": input_text,
    "voice": voice,
    "response_format": "opus",
}

with requests.post(url, headers=headers, json=data, stream=True) as response:
    if response.status_code == 200:
        buffer = io.BytesIO()
        for chunk in response.iter_content(chunk_size=4096):
            buffer.write(chunk)

Hope that helps!

For those interested in sentence-based splitting, an openai-node user shared some code which might be helpful.

Cool, thanks. It doesn’t seem to use streaming to get the audio from tts, but maybe it’s fine if getting the chunks as whole is fast enough.

    const arrayBuffer = await response.arrayBuffer();
    const blob = new Blob([arrayBuffer], { type: 'audio/mpeg' });
    const url = URL.createObjectURL(blob);

@rattrayalex thank you Alex, that example works for me!

I spent some time trying to figure out why response_format="wav" format gives a faster TTFB than pcm, and I found that only the WAV is actually sent more quickly, but the first audio data takes roughly the same time to arrive.

It might be nice to use the header to set the pyaudio options instead of hardcoding them, but I couldn’t figure out how to do that (didn’t manage to turn the response into something I can feed into wave.open()).

As of today (openai==1.12.0), the Python example on the text-to-speech quickstart yields:

DeprecationWarning: Due to a bug, this method doesn’t actually stream the response content, .with_streaming_response.method() should be used instead

I found the message a bit cryptic and couldn’t find any real documentation of this .with_streaming_response.method(), but after some digging I was able to get it working:

from openai import OpenAI

client = OpenAI()

with client.audio.speech.with_streaming_response.create(
    model="tts-1",
    voice="alloy",
    input="""I see skies of blue and clouds of white
             The bright blessed days, the dark sacred nights
             And I think to myself
             What a wonderful world""",
) as response:
    # This doesn't seem to be *actually* streaming, it just creates the file
    # and then doesn't update it until the whole generation is finished
    response.stream_to_file("speech.mp3")

But I wasn’t able to achieve actual streaming with the Python library, only through the REST API (see my post on OpenAI Forum).

It would be great to have improved documentation and support for streaming TTS!

Hey @nimobeeren, thanks for the repro. On my computer this does seem to be streaming to the file as expected 😦. Would you be able to provide what system you’re using (it shouldn’t make a difference, but just in case) and how you were able to determine that the file wasn’t streaming?

of course, thank you @RobertCraigie! I’m not sure about how to handle an exception from OpenAI in this case, but this is a great start.

for reference:

    def generate():
        with openai_client.audio.speech.with_streaming_response.create(
            model="tts-1",
            voice="alloy",
            input=input,
            response_format="mp3"
        ) as response:
            if response.status_code == 200:
                for chunk in response.iter_bytes(chunk_size=2048):
                    yield chunk

    return StreamingResponse(
        content=generate(),
        media_type="audio/mp3"
    )

Ah, I’m sorry to say the fix in #866 is being reverted; upon further discussion, we’ve found a better way that should be available in the coming days. Thank you for your patience.

Seems to be fixed in https://github.com/openai/openai-python/pull/866

I hacked it in earlier in https://github.com/openai/openai-python/pull/724 and results were good, when I just passed the stream=True parameter in … even for a long (like 30s or more) text, i got to start hearing it in a browser client in about 1s.

@RobertCraigie here seems to say, though, that it only starts when the whole audio is completed on the server? Apparently the generation is quick, then?