Real-Time Voice,
Zero Latency.
Experience the next generation of streaming speech synthesis. <300ms latency, emotional intelligence, and multi-role capabilities in a lightweight 0.5B model.
See VibeVoice in Action
Watch how VibeVoice generates speech in real-time with ultra-low latency.
Demo taking too long to load? Server might be busy.
Open Demo in New Tab →Why VibeVoice?
Zero Latency
Streaming speech synthesis with <300ms latency. Speaks before you finish typing.
0.5B Parameters
Ultra-lightweight model optimized for edge devices and real-time applications.
Streaming Input
Incrementally encodes incoming text chunks while generating audio in parallel.
Robust Long-form
Generates consistent, high-quality speech for long durations (~10 min) without degradation.
Emotional Intelligence
Automatically detects and expresses anger, excitement, sadness, and more.
Multi-Role Conversation
Supports up to 4 distinct speakers in a single session with consistent prosody.
Benchmark Performance
VibeVoiceRealTime-0.5B achieves state-of-the-art latency while maintaining high audio fidelity.
Latency
First Audible Token
Model Size
Parameters (Lightweight)
Context
Token Context Window
LibriSpeech Test-Clean
| Model | WER (%) ↓ | SIM ↑ |
|---|---|---|
| VALL-E 2 | 2.40 | 0.643 |
| Voicebox | 1.90 | 0.662 |
| MELLE | 2.10 | 0.625 |
| VibeVoice-0.5B | 2.00 | 0.695 |
SEED Test-En
| Model | WER (%) ↓ | SIM ↑ |
|---|---|---|
| MaskGCT | 2.62 | 0.714 |
| Seed-TTS | 2.25 | 0.762 |
| FireRedTTS | 3.82 | 0.460 |
| VibeVoice-0.5B | 2.05 | 0.633 |
Get Started
1. Launch Docker
sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it nvcr.io/nvidia/pytorch:24.07-py32. Install from GitHub
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/
pip install -e .3. Run Inference
# Run real-time demo
python demo/vibevoice_realtime_demo.py --model_path microsoft/VibeVoice-Realtime-0.5BArchitecture

Interleaved Windowed Design
Processes text chunks immediately as they arrive, enabling "speak-while-thinking" capability.
Curriculum Learning
Progressively expands context window up to 8192 tokens for long-form coherence.
Diffusion Decoding
Uses DDPM with Classifier-Free Guidance for high-fidelity audio generation.
Built for Real-Time
AI Assistants
Give your chatbot a human voice that interrupts naturally and expresses emotion.
Live Streaming
Generate real-time commentary or dubbing for live video content with zero delay.
Gaming NPCs
Create immersive characters that respond instantly to player actions with appropriate tone.
Accessibility
Instant screen reading for visually impaired users that feels like a real conversation.
Community Buzz
See what developers and researchers are saying about VibeVoice.