Voice Chat with Local LLMs: Whisper + TTS
π More on this topic: Open WebUI Setup Guide Β· Best Models for Chat Β· Local RAG Guide Β· VRAM Requirements
Talking to your local LLM instead of typing is one of those things that sounds like a gimmick until you try it. Once you can just speak a question and hear the answer back, it changes how you interact with local AI entirely.
The pipeline is simpler than you’d think: Whisper listens to you, your LLM thinks, and a TTS engine reads the response aloud. Three pieces, all running locally, no cloud required.
Here’s how to set it up.
How the Voice Pipeline Works
Every local voice assistant follows the same three-step chain:
- STT (Speech-to-Text): Microphone audio β Whisper β text transcript
- LLM (Language Model): Text prompt β Ollama/llama.cpp β text response
- TTS (Text-to-Speech): Text response β TTS engine β speaker audio
The total latency is the sum of all three. On a decent GPU, you’re looking at:
| Stage | Typical Latency | What Drives It |
|---|---|---|
| STT (Whisper turbo) | 100-300 ms | Audio length, model size, GPU speed |
| LLM (time to first token) | 200-500 ms | Model size, context length, GPU speed |
| TTS (first audio chunk) | 100-300 ms | TTS engine, voice quality, GPU/CPU |
| Total round-trip | 500-1100 ms | Everything above combined |
That’s roughly 0.5-1.1 seconds from when you stop talking to when you start hearing the answer. Not quite real-time conversation, but fast enough to feel natural.
Step 1: Speech-to-Text with Whisper
OpenAI’s Whisper is the de facto standard for local speech recognition. It’s open-source, runs on consumer hardware, and is genuinely good β accurate across accents, handles background noise, supports 99 languages.
Whisper Model Sizes
| Model | Parameters | VRAM | Relative Speed | Accuracy (WER) |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | 32x | Rough β fine for commands |
| base | 74M | ~1 GB | 16x | Decent for clear speech |
| small | 244M | ~2 GB | 6x | Good general use |
| medium | 769M | ~5 GB | 2x | Great accuracy |
| large-v3 | 1.55B | ~10 GB | 1x | Best accuracy |
| turbo | 809M | ~6 GB | 6-8x | Near large-v3 quality |
The one to use: Whisper turbo. It’s 6-8x faster than large-v3 with minimal accuracy loss. Unless you’re transcribing heavily accented speech in a noisy room, turbo is the sweet spot.
Which Whisper Implementation?
You have three main options:
faster-whisper (recommended for GPU users): Uses CTranslate2 under the hood. 2-4x faster than OpenAI’s original code, uses 40-60% less VRAM. This is what most voice pipeline tools use internally.
pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("turbo", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5)
for segment in segments:
print(segment.text)
whisper.cpp (recommended for CPU-only): C/C++ port that runs well without a GPU. Great for laptops and low-power setups. Download GGUF model files and run from the command line:
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
./main -m models/ggml-base.en.bin -f audio.wav
Original OpenAI Whisper (the baseline): Works but is the slowest option. Only use this if you need a specific feature the others don’t support.
pip install openai-whisper
whisper audio.wav --model turbo
Real-Time Streaming
For voice chat, you don’t want to record a whole sentence and then transcribe it. You want streaming β transcribing as you speak. faster-whisper supports this with voice activity detection (VAD):
from faster_whisper import WhisperModel
model = WhisperModel("turbo", device="cuda")
# Use Silero VAD to detect speech segments automatically
segments, _ = model.transcribe("audio.wav", vad_filter=True)
For a microphone-to-text pipeline, pair faster-whisper with PyAudio or sounddevice to capture audio chunks, and feed them to the model as they come in.
Step 2: Text-to-Speech Options
TTS is where the local ecosystem has exploded recently. You have options ranging from “sounds like a GPS” to “indistinguishable from a real person.”
TTS Comparison
| Engine | Size | Runs On | Quality | Latency | Voice Cloning |
|---|---|---|---|---|---|
| Kokoro | 82M params | CPU or GPU | Excellent β #1 on TTS Arena | Sub-300 ms | No (preset voices) |
| Piper | 15-20M params | CPU only | Good β clear and natural | Sub-100 ms | No (trained voices) |
| Chatterbox | ~300M params | GPU (2-4 GB) | Excellent β beats ElevenLabs in blind tests | Sub-200 ms | Yes (10s reference clip) |
| Coqui XTTS | ~1.5B params | GPU (~2 GB) | Very good, multilingual | 300-500 ms | Yes (6s reference clip) |
| Bark | ~1B params | GPU (~12 GB) | Expressive, can laugh/sing | 2-5 seconds | Limited |
| edge-tts | Cloud | Internet required | Very good (Microsoft voices) | 100-300 ms | No |
Kokoro (Best Overall)
Kokoro is a 82M-parameter model that hit #1 on the TTS Arena leaderboard, beating commercial services. It’s fast, sounds natural, and runs on CPU or GPU.
pip install kokoro>=0.9
from kokoro import KPipeline
pipeline = KPipeline(lang_code="a") # American English
generator = pipeline("Hello! I'm your local AI assistant.", voice="af_heart")
for i, (gs, ps, audio) in enumerate(generator):
# audio is a numpy array, play or save it
pass
Kokoro has multiple preset voices. No voice cloning, but the built-in voices sound genuinely good.
Piper (Best for CPU-Only)
If you don’t have a GPU β or your GPU is fully occupied by the LLM β Piper is the answer. It’s a neural TTS engine that runs entirely on CPU with near-instant output:
echo "Hello from your local assistant" | \
piper --model en_US-lessac-medium --output_file response.wav
Piper has dozens of pre-trained voices across many languages. Quality is a step below Kokoro or Chatterbox, but speed on CPU is unbeatable.
Download voices from the Piper voices repository.
Chatterbox (Best Voice Cloning)
ResembleAI’s Chatterbox can clone any voice from a 10-second audio clip and generates speech that beat ElevenLabs in blind listener tests. It’s open-source and runs locally:
pip install chatterbox-tts
import torchaudio
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda")
text = "Here's your local AI assistant, speaking in a cloned voice."
wav = model.generate(text, audio_prompt_path="reference_voice.wav")
torchaudio.save("output.wav", wav, model.sr)
Needs about 2-4 GB VRAM. The voice cloning is genuinely impressive for a local model.
Skip These (For Now)
Bark: Sounds amazing β it can express emotions, laugh, even sing. But it’s painfully slow (2-5 seconds per sentence) and needs ~12 GB VRAM. Fun to demo, impractical for conversation.
edge-tts: Great quality and free, but it sends audio through Microsoft’s servers. Not local. Fine if you don’t care about privacy, but defeats the purpose for most readers of this site.
The Easy Way: Open WebUI Voice Chat
If you just want to talk to your local LLM without building anything, Open WebUI has built-in voice chat. It works with any Ollama model.
Setup
- Install Ollama and pull a chat model:
ollama pull qwen3:8b
- Install Open WebUI:
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open
http://localhost:3000, create an account, select your model.Click the microphone icon in the chat input to use voice. Open WebUI uses your browser’s built-in speech recognition (Web Speech API) for STT and can use various TTS backends.
Configuring Better TTS
Open WebUI’s default TTS is basic. For better quality, go to Settings β Audio and configure:
- STT Engine: Set to “whisper (local)” if you want fully local transcription
- TTS Engine: Point to a local TTS server (Kokoro or Piper running as a service)
This gives you a ChatGPT-like voice interface running entirely on your machine.
The DIY Way: Command-Line Voice Pipeline
For maximum control, you can wire up the pipeline yourself. Here’s a minimal working example using faster-whisper + Ollama + Kokoro:
import subprocess
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
from kokoro import KPipeline
import requests
import json
# Initialize models
whisper = WhisperModel("turbo", device="cuda", compute_type="float16")
tts = KPipeline(lang_code="a")
def record_audio(duration=5, sample_rate=16000):
"""Record from microphone."""
print("Listening...")
audio = sd.rec(int(duration * sample_rate),
samplerate=sample_rate, channels=1, dtype="float32")
sd.wait()
return audio.flatten()
def transcribe(audio):
"""Speech to text with faster-whisper."""
segments, _ = whisper.transcribe(audio, beam_size=5)
return " ".join(s.text for s in segments).strip()
def query_llm(prompt):
"""Send text to Ollama and get response."""
response = requests.post("http://localhost:11434/api/generate",
json={"model": "qwen3:8b", "prompt": prompt, "stream": False})
return response.json()["response"]
def speak(text):
"""Text to speech with Kokoro."""
for _, _, audio in tts(text, voice="af_heart"):
sd.play(audio, samplerate=24000)
sd.wait()
# Main loop
while True:
audio = record_audio()
text = transcribe(audio)
if not text:
continue
print(f"You: {text}")
response = query_llm(text)
print(f"AI: {response}")
speak(response)
This is intentionally simple. For a real setup, you’d add:
- Voice activity detection (VAD) instead of fixed-duration recording
- Streaming LLM output to TTS (speak while generating)
- Conversation history (multi-turn context)
- A wake word or push-to-talk
Dedicated Voice Chat Apps
If you want something more polished than a script but more customizable than Open WebUI:
Moshi (Full-Duplex)
Moshi from Kyutai is the first open-source full-duplex voice model β it can listen and talk simultaneously, like a phone call. No STTβLLMβTTS chain; it’s a single speech-to-speech model.
- 7B backbone with Mimi audio codec
- 160-200 ms latency
- Runs on a single GPU (needs ~16 GB VRAM)
- Genuinely feels like talking to someone
The catch: it’s a fixed model. You can’t swap in your favorite LLM. The voice quality and intelligence are whatever Moshi provides. But for natural conversation flow, nothing else comes close locally.
pip install moshi
python -m moshi.run
RealtimeVoiceChat
An open-source project that wires up the full STTβLLMβTTS pipeline with a web interface and push-to-talk. Supports Ollama, faster-whisper, and multiple TTS engines.
Pipecat
A framework for building voice assistants. More complex to set up, but supports interruption handling, different conversation flows, and multiple backend options. Good if you’re building something production-grade.
Hardware Requirements
The big question: can your GPU handle the LLM and the voice pipeline simultaneously?
VRAM Budget
| Component | VRAM Needed |
|---|---|
| Whisper turbo (faster-whisper, FP16) | ~6 GB |
| Whisper turbo (faster-whisper, INT8) | ~3 GB |
| Whisper small (faster-whisper, INT8) | ~1 GB |
| Kokoro TTS | ~0.5 GB (or CPU) |
| Piper TTS | 0 GB (CPU only) |
| Chatterbox TTS | 2-4 GB |
| Your LLM | Depends on model |
The math for a 12 GB GPU (RTX 3060):
- Whisper small INT8: 1 GB
- Piper TTS: 0 GB (runs on CPU)
- Remaining for LLM: ~10 GB β Qwen3-8B Q4_K_M fits comfortably
The math for a 24 GB GPU (RTX 3090):
- Whisper turbo INT8: 3 GB
- Kokoro TTS: 0.5 GB (or run on CPU)
- Remaining for LLM: ~19 GB β Qwen3-14B Q6_K or Qwen3-32B Q3_K_M
CPU-only setup (no GPU):
- whisper.cpp with base model: runs fine on any modern CPU
- Piper TTS: CPU-native, sub-100ms latency
- LLM: Qwen3-4B Q4_K_M via llama.cpp (~4 GB RAM)
- Total: workable but slower. Expect 2-4 second round-trips.
The Sharing Problem
Whisper and your LLM can’t use the GPU simultaneously by default β one waits for the other. This is fine because the pipeline is sequential anyway. Whisper finishes transcribing before the LLM starts generating.
TTS is where it gets tricky. If your TTS runs on GPU, it competes with the LLM for VRAM. The simplest fix: run TTS on CPU (Piper or Kokoro both handle this well) and keep the GPU for Whisper + LLM.
β Use our Planning Tool to check exact VRAM for your setup.
Latency Optimization Tips
Want to get below 1 second total? Here’s what actually helps:
Use Whisper turbo, not large-v3. The accuracy difference is minimal, the speed difference is 6-8x.
Use INT8 quantization for Whisper. Cuts VRAM roughly in half with negligible accuracy loss:
model = WhisperModel("turbo", device="cuda", compute_type="int8")
Enable VAD (Voice Activity Detection). Faster-whisper’s built-in Silero VAD trims silence before transcription, so Whisper only processes actual speech:
segments, _ = model.transcribe(audio, vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500))
Stream the LLM output to TTS. Don’t wait for the full response. Start speaking the first sentence while the LLM generates the rest. This is the single biggest latency improvement β it turns a 3-second wait into perceived sub-second response.
Run TTS on CPU, STT on GPU. Piper and Kokoro are fast enough on CPU that you don’t need GPU acceleration. This leaves all VRAM for Whisper and the LLM.
Use a smaller LLM. A Qwen3-4B model generates its first token in ~100 ms on a decent GPU. A 32B model takes 400+ ms. For voice chat, speed matters more than model intelligence β pick the smallest model that gives acceptable answers.
Recommended Setups
Budget (8 GB VRAM or CPU-only)
- STT: whisper.cpp with base model (CPU) or faster-whisper small INT8 (GPU)
- LLM: Qwen3-4B Q4_K_M via Ollama
- TTS: Piper (CPU)
- Interface: Open WebUI or custom script
- Expected latency: 1.5-3 seconds
Mid-Range (12 GB VRAM)
- STT: faster-whisper turbo INT8 (GPU, ~3 GB)
- LLM: Qwen3-8B Q4_K_M via Ollama (~5 GB)
- TTS: Kokoro (CPU) or Piper (CPU)
- Interface: Open WebUI with local Whisper
- Expected latency: 0.8-1.5 seconds
High-End (24 GB VRAM)
- STT: faster-whisper turbo FP16 (GPU, ~6 GB)
- LLM: Qwen3-14B Q6_K via Ollama (~13 GB)
- TTS: Kokoro (CPU) or Chatterbox (GPU if VRAM allows)
- Interface: Custom pipeline with streaming
- Expected latency: 0.5-1.0 seconds
Common Problems
“Whisper keeps transcribing silence.”
Enable VAD filtering. Without it, Whisper tries to transcribe background noise and outputs garbage. Use vad_filter=True in faster-whisper.
“The TTS voice sounds robotic.” Switch from Piper to Kokoro. Or try a higher-quality Piper voice β the “medium” and “high” quality voices sound significantly better than “low.”
“There’s a long pause before the response starts.” You’re probably waiting for the full LLM response before sending it to TTS. Stream the output and start speaking the first sentence immediately.
“My GPU runs out of memory.” Run TTS on CPU (Piper or Kokoro). Use Whisper small or base instead of turbo. Use a smaller LLM quantization (Q3_K_M instead of Q4_K_M).
“CUDA out of memory when switching between Whisper and LLM.”
Some frameworks don’t release VRAM properly. Use del model; torch.cuda.empty_cache() between stages, or keep both models loaded if VRAM allows.
Bottom Line
Local voice chat works today and it’s easier to set up than you’d think. The fastest path:
- Install Ollama + Open WebUI for instant voice chat
- For better quality, add faster-whisper (turbo) and Kokoro
- For the most natural experience, try Moshi (if you have 16 GB VRAM)
With a 12 GB GPU, you can run the full pipeline β Whisper turbo INT8, an 8B chat model, and Piper TTS β with about 1 second of latency. That’s fast enough for actual conversation.
The technology gap between “local voice assistant” and “cloud voice assistant” is closing fast. A year ago this would have required multiple high-end GPUs. Now it runs on an RTX 3060.