End-to-End ASR, Text Generation, and TTS

Upload or record audio. The model will transcribe, generate a response, and read it out.