core: Wyoming integration returning incorrect URLs from piper

The problem

I have setup an Atom Echo as a voice assistant interface following the tutorial (https://www.home-assistant.io/projects/thirteen-usd-voice-remote/).

This works, however generated responses are not played back. Looking at the ESP logs, the Piper response URL sometimes ends in .raw, sometimes .mp3. The response needs to be in wav format, which I can manually do by changing the URL (see attached log snippet).

Unclear if related to https://github.com/home-assistant/core/issues/92528.

What version of Home Assistant Core has the issue?

core-2023.5.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

wyoming

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

[00:11:41][D][voice_assistant:112]: Response: "Sorry, I couldn't understand that"
[00:11:41][D][voice_assistant:127]: Response URL: "http://ha.k8s.services.lan/api/tts_proxy/dae2cdcb27a1d1c3b07ba2c7db91480f9d4bfd8f_en-us_47f5ba5b18_tts.piper.raw"
[00:11:41][D][voice_assistant:132]: Assist Pipeline ended

Additional information

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 31 (10 by maintainers)

Commits related to this issue

Most upvoted comments

This was actually trivial to implement: https://github.com/esphome/ESP32-audioI2S/pull/12

You can test it with the following steps:

Copy the i2s_audio component

  • Go to your Home Assistant cli (VM, SSH, or whatever)
  • Type login to get a shell
  • Exec into the ESPHome container: docker exec -it addon_5c53de3b_esphome bash
    • Create the directory structure: mkdir -p /config/esphome/my_components/
    • Copy the i2s_audio component: cp -ar /esphome/esphome/components/i2s_audio /config/esphome/my_components/

Clone the patched ESP32-audioI2S library

  • Open a terminal in Home Assistant (VSCode addon, Terminal addon, doesn’t matter)
  • Create the directory structure: mkdir -p /config/esphome/my_libs/
  • Clone the lib: cd /config/esphome/my_libs/ && git clone https://github.com/robin-thoni/ESP32-audioI2S

Configure the new ESP32-audioI2S library

  • Open the VSCode or file browser addon. You should be able to see the esphome/my_components/i2s_audio and esphome/my_libs/ESP32-audioI2S folders
  • Open esphome/my_components/i2s_audio/media_player/__init__.py (NOT the __init__.py file at the root of the component, the one in the media_player folder)
  • At the end, replace cg.add_library("esphome/ESP32-audioI2S", "2.0.7") by cg.add_library("file:///config/esphome/my_libs/ESP32-audioI2S", None)

Rebuild

  • Add this to your device yaml
    external_components:
      - source:
          type: local
          path: my_components
        components: [i2s_audio]
    
  • Clean the build files: in ESPHome addon, on the home page (where all the devices are listed), click on the tree dots menu for the device you want to patch, click “Clean Build Files”
  • Install the device as usual

It should now play the .raw files generated by HA, without eating the end of the file. Here’s a quick demo: https://owncloud.rthoni.com/s/DfcrJXLZoRLpFpq

I can confirm @grahambrown11, i have the same issue.

@synesthesiam , I tested the fix using the de version of HA and it is working fine with an esphome media player.

thank you once again for the fix.

Oh please someone merge that PR!

Same problems here:

  • Media player
    • Won’t play raw files (that’s expected, after reading the sources)
  • Speaker
    • Eats the end of the audio
    • Can’t stream media to it
    • Can’t easily control volume

Maybe a solution/workaround would be to add raw format support to ESP32-audioI2S? This way, the media player would be able to play raw files, eliminating the need for the speaker component altogether.

I see a few references to CODEC_MP3 in https://github.com/esphome/ESP32-audioI2S/blob/07cb6eb71fbc47d45185270b5c84c762a126bbc3/src/Audio.cpp. Adding a new raw “codec” shouldn’t be that hard, since it’s exactly what the i2s function expects. I might give it a try soonish.

I’m also getting the .raw format when using either media_player or speaker in ESPHome, however the speaker component does playback the audio where the media player does not…

media_player log:

[10:36:24][D][voice_assistant:192]: Response: "Turned off light"
[10:36:24][D][voice_assistant:207]: Response URL: "http://192.168.1.9:8123/api/tts_proxy/db98156bf572727274889253f275cea21c83824c_en-gb_71e0aacf05_tts.piper.raw"
[10:36:24][D][media_player:059]: 'Media Player' - Setting
[10:36:24][D][media_player:066]:   Media URL: http://192.168.1.9:8123/api/tts_proxy/db98156bf572727274889253f275cea21c83824c_en-gb_71e0aacf05_tts.piper.raw

speaker log:

[10:49:05][D][voice_assistant:192]: Response: "Turned on light"
[10:49:05][D][voice_assistant:207]: Response URL: "http://192.168.1.9:8123/api/tts_proxy/c9423eae01959b2af87c0b8d21f861b36e9b0fec_en-gb_4a8f4e3e86_tts.piper.raw"
[10:49:05][D][voice_assistant:218]: Assist Pipeline ended

It’s still not working for me when I use a media_player for output, as the default voice assistant configuration does. A speaker works fine but the volume can’t be controlled.

And as noted in my comment above, two things: (1) no TTS comes out of the voice assistant, but I can send audio (including piper and other TTS!) directly from Home Assistant to the ESPHome media_player and it works; and (2) the ESP device thinks it’s playing audio, as the green pulsating LED continues pulsating for some period and then stops, as if it were speaking.

This is all using ESPHome 2023.7.0 and HA 2023.7.2.

Here are some logs from ESPHome 2023.7.0 on an Atom Echo:

[19:20:24][I][app:102]: ESPHome version 2023.7.0 compiled on Jul 19 2023, 19:13:04
[19:20:24][I][app:104]: Project m5stack.atom-echo version 1.0
[19:20:24][C][wifi:543]: WiFi:
[19:20:24][C][wifi:379]:   Local MAC: 64:B7:08:80:31:68
[19:20:24][C][wifi:380]:   SSID: [redacted]
[19:20:24][C][wifi:381]:   IP Address: 192.168.11.16
[19:20:24][C][wifi:383]:   BSSID: [redacted]
[19:20:24][C][wifi:384]:   Hostname: 'atomecho-voice-assist-1'
[19:20:24][C][wifi:386]:   Signal strength: -67 dB ▂▄▆█
[19:20:24][C][wifi:390]:   Channel: 1
[19:20:24][C][wifi:391]:   Subnet: 255.255.0.0
[19:20:24][C][wifi:392]:   Gateway: 192.168.17.1
[19:20:24][C][wifi:393]:   DNS1: 192.168.17.1
[19:20:24][C][wifi:394]:   DNS2: 0.0.0.0
[19:20:24][C][logger:301]: Logger:
[19:20:24][C][logger:302]:   Level: DEBUG
[19:20:24][C][logger:303]:   Log Baud Rate: 115200
[19:20:24][C][logger:305]:   Hardware UART: UART0
[19:20:24][C][esp32_rmt_led_strip:171]: ESP32 RMT LED Strip:
[19:20:24][C][esp32_rmt_led_strip:172]:   Pin: 27
[19:20:24][C][esp32_rmt_led_strip:173]:   Channel: 0
[19:20:24][C][esp32_rmt_led_strip:198]:   RGB Order: GRB
[19:20:24][C][esp32_rmt_led_strip:199]:   Max refresh rate: 0
[19:20:24][C][esp32_rmt_led_strip:200]:   Number of LEDs: 1
[19:20:25][C][gpio.binary_sensor:015]: GPIO Binary Sensor 'Button'
[19:20:25][C][gpio.binary_sensor:016]:   Pin: GPIO39
[19:20:25][C][light:103]: Light 'atomecho-voice-assist-1'
[19:20:25][C][light:105]:   Default Transition Length: 0.0s
[19:20:25][C][light:106]:   Gamma Correct: 2.80
[19:20:25][C][captive_portal:088]: Captive Portal:
[19:20:25][C][mdns:112]: mDNS:
[19:20:25][C][mdns:113]:   Hostname: atomecho-voice-assist-1
[19:20:25][C][ota:093]: Over-The-Air Updates:
[19:20:25][C][ota:094]:   Address: atomecho-voice-assist-1.local:3232
[19:20:25][C][api:138]: API Server:
[19:20:25][C][api:139]:   Address: atomecho-voice-assist-1.local:6053
[19:20:25][C][api:141]:   Using noise encryption: YES
[19:20:25][C][improv_serial:032]: Improv Serial:
[19:20:25][C][audio:203]: Audio:
[19:20:25][C][audio:225]:   External DAC channels: 1
[19:20:25][C][audio:226]:   I2S DOUT Pin: 22
[19:20:30][D][binary_sensor:036]: 'Button': Sending state ON
[19:20:31][D][voice_assistant:132]: Requesting start...
[19:20:31][D][voice_assistant:111]: Starting...
[19:20:32][D][voice_assistant:154]: Assist Pipeline running
[19:20:32][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:20:32][D][light:047]:   State: ON
[19:20:32][D][light:059]:   Red: 0%, Green: 0%, Blue: 100%
[19:20:33][D][binary_sensor:036]: 'Button': Sending state OFF
[19:20:33][D][voice_assistant:144]: Signaling stop...
[19:20:34][D][voice_assistant:168]: Speech recognised as: " Set the office desk lamp to white."
[19:20:34][D][voice_assistant:144]: Signaling stop...
[19:20:34][D][voice_assistant:192]: Response: "Color set"
[19:20:34][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:20:34][D][light:059]:   Red: 0%, Green: 100%, Blue: 0%
[19:20:34][D][voice_assistant:207]: Response URL: "http://192.168.17.10:8123/api/tts_proxy/9a9f96af14ebec28fdc2f47c5ae5cfa7b4e512a4_en-us_718a3c601f_tts.piper.mp3"
[19:20:34][D][media_player:059]: 'atomecho-voice-assist-1' - Setting
[19:20:34][D][media_player:066]:   Media URL: http://192.168.17.10:8123/api/tts_proxy/9a9f96af14ebec28fdc2f47c5ae5cfa7b4e512a4_en-us_718a3c601f_tts.piper.mp3
[19:20:34][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:20:34][D][light:059]:   Red: 0%, Green: 100%, Blue: 0%
[19:20:34][D][light:109]:   Effect: 'Pulse'
[19:20:34][W][component:204]: Component api took a long time for an operation (0.05 s).
[19:20:34][W][component:205]: Components should block for at most 20-30ms.
[19:20:35][W][component:204]: Component i2s_audio.media_player took a long time for an operation (0.55 s).
[19:20:35][W][component:205]: Components should block for at most 20-30ms.
[19:20:35][D][voice_assistant:218]: Assist Pipeline ended
[19:20:54][W][component:204]: Component i2s_audio.media_player took a long time for an operation (0.06 s).
[19:20:54][W][component:205]: Components should block for at most 20-30ms.
[19:20:55][W][component:204]: Component i2s_audio.media_player took a long time for an operation (0.46 s).
[19:20:55][W][component:205]: Components should block for at most 20-30ms.
[19:20:55][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:20:55][D][light:047]:   State: OFF
[19:20:55][D][light:109]:   Effect: 'None'
[19:21:07][D][binary_sensor:036]: 'Button': Sending state ON
[19:21:07][D][voice_assistant:132]: Requesting start...
[19:21:07][D][voice_assistant:111]: Starting...
[19:21:08][D][voice_assistant:154]: Assist Pipeline running
[19:21:08][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:21:08][D][light:047]:   State: ON
[19:21:08][D][light:059]:   Red: 0%, Green: 0%, Blue: 100%
[19:21:09][D][binary_sensor:036]: 'Button': Sending state OFF
[19:21:09][D][voice_assistant:144]: Signaling stop...
[19:21:10][D][voice_assistant:168]: Speech recognised as: " Set the office desk lamp to red."
[19:21:10][D][voice_assistant:144]: Signaling stop...
[19:21:10][D][voice_assistant:192]: Response: "Color set"
[19:21:10][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:21:10][D][light:059]:   Red: 0%, Green: 100%, Blue: 0%
[19:21:10][D][voice_assistant:207]: Response URL: "http://192.168.17.10:8123/api/tts_proxy/9a9f96af14ebec28fdc2f47c5ae5cfa7b4e512a4_en-us_718a3c601f_tts.piper.raw"
[19:21:10][D][media_player:059]: 'atomecho-voice-assist-1' - Setting
[19:21:10][D][media_player:066]:   Media URL: http://192.168.17.10:8123/api/tts_proxy/9a9f96af14ebec28fdc2f47c5ae5cfa7b4e512a4_en-us_718a3c601f_tts.piper.raw
[19:21:10][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:21:10][D][light:059]:   Red: 0%, Green: 100%, Blue: 0%
[19:21:10][D][light:109]:   Effect: 'Pulse'
[19:21:10][W][component:204]: Component api took a long time for an operation (0.05 s).
[19:21:10][W][component:205]: Components should block for at most 20-30ms.
[19:21:11][W][component:204]: Component i2s_audio.media_player took a long time for an operation (0.54 s).
[19:21:11][W][component:205]: Components should block for at most 20-30ms.
[19:21:11][D][voice_assistant:218]: Assist Pipeline ended
[19:21:11][W][component:204]: Component i2s_audio.media_player took a long time for an operation (0.46 s).
[19:21:11][W][component:205]: Components should block for at most 20-30ms.
[19:21:12][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:21:12][D][light:047]:   State: OFF
[19:21:12][D][light:109]:   Effect: 'None'
[19:21:28][D][binary_sensor:036]: 'Button': Sending state ON
[19:21:29][D][voice_assistant:132]: Requesting start...
[19:21:29][D][voice_assistant:111]: Starting...
[19:21:29][D][voice_assistant:154]: Assist Pipeline running
[19:21:29][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:21:29][D][light:047]:   State: ON
[19:21:29][D][light:059]:   Red: 0%, Green: 0%, Blue: 100%
[19:21:30][D][binary_sensor:036]: 'Button': Sending state OFF
[19:21:30][D][voice_assistant:144]: Signaling stop...
[19:21:31][D][voice_assistant:168]: Speech recognised as: " Turn off the office desk lamp."
[19:21:31][D][voice_assistant:144]: Signaling stop...
[19:21:31][D][voice_assistant:192]: Response: "Turned off light"
[19:21:31][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:21:31][D][light:059]:   Red: 0%, Green: 100%, Blue: 0%
[19:21:31][D][voice_assistant:207]: Response URL: "http://192.168.17.10:8123/api/tts_proxy/db98156bf572727274889253f275cea21c83824c_en-us_718a3c601f_tts.piper.mp3"
[19:21:31][D][media_player:059]: 'atomecho-voice-assist-1' - Setting
[19:21:31][D][media_player:066]:   Media URL: http://192.168.17.10:8123/api/tts_proxy/db98156bf572727274889253f275cea21c83824c_en-us_718a3c601f_tts.piper.mp3
[19:21:31][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:21:31][D][light:059]:   Red: 0%, Green: 100%, Blue: 0%
[19:21:31][D][light:109]:   Effect: 'Pulse'
[19:21:32][W][component:204]: Component i2s_audio.media_player took a long time for an operation (0.54 s).
[19:21:32][W][component:205]: Components should block for at most 20-30ms.
[19:21:32][D][voice_assistant:218]: Assist Pipeline ended
[19:21:39][W][component:204]: Component i2s_audio.media_player took a long time for an operation (0.06 s).
[19:21:39][W][component:205]: Components should block for at most 20-30ms.
[19:21:39][W][component:204]: Component i2s_audio.media_player took a long time for an operation (0.47 s).
[19:21:39][W][component:205]: Components should block for at most 20-30ms.
[19:21:39][D][light:036]: 'atomecho-voice-assist-1' Setting:
[19:21:39][D][light:047]:   State: OFF
[19:21:39][D][light:109]:   Effect: 'None'

Two additional notes:

  1. Manually playing back a Piper TTS (via the media entity) works fine.
  2. Playing back the audio via the assist debugger results in an error.

So I suspect the pipeline itself is somehow corrupting the response audio stream.