New Release: Microsoft VibeVoice-Realtime-0.5B

Real-Time Voice,
Zero Latency.

Experience the next generation of streaming speech synthesis. <300ms latency, emotional intelligence, and multi-role capabilities in a lightweight 0.5B model.

0.5BParameters
<300msFirst Token Latency
24kHzAudio Quality

See VibeVoice in Action

Watch how VibeVoice generates speech in real-time with ultra-low latency.

Demo taking too long to load? Server might be busy.

Open Demo in New Tab →

Why VibeVoice?

Zero Latency

Streaming speech synthesis with <300ms latency. Speaks before you finish typing.

300ms

0.5B Parameters

Ultra-lightweight model optimized for edge devices and real-time applications.

Streaming Input

Incrementally encodes incoming text chunks while generating audio in parallel.

Robust Long-form

Generates consistent, high-quality speech for long durations (~10 min) without degradation.

Emotional Intelligence

Automatically detects and expresses anger, excitement, sadness, and more.

Multi-Role Conversation

Supports up to 4 distinct speakers in a single session with consistent prosody.

Benchmark Performance

VibeVoiceRealTime-0.5B achieves state-of-the-art latency while maintaining high audio fidelity.

Latency

<300ms

First Audible Token

Model Size

0.5B

Parameters (Lightweight)

Context

8k

Token Context Window

LibriSpeech Test-Clean

ModelWER (%) ↓SIM ↑
VALL-E 22.400.643
Voicebox1.900.662
MELLE2.100.625
VibeVoice-0.5B2.000.695

SEED Test-En

ModelWER (%) ↓SIM ↑
MaskGCT2.620.714
Seed-TTS2.250.762
FireRedTTS3.820.460
VibeVoice-0.5B2.050.633

Get Started

1. Launch Docker

sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it nvcr.io/nvidia/pytorch:24.07-py3

2. Install from GitHub

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/
pip install -e .

3. Run Inference

# Run real-time demo
python demo/vibevoice_realtime_demo.py --model_path microsoft/VibeVoice-Realtime-0.5B

Architecture

VibeVoice Realtime Architecture

Interleaved Windowed Design

Processes text chunks immediately as they arrive, enabling "speak-while-thinking" capability.

Curriculum Learning

Progressively expands context window up to 8192 tokens for long-form coherence.

Diffusion Decoding

Uses DDPM with Classifier-Free Guidance for high-fidelity audio generation.

Built for Real-Time

🤖

AI Assistants

Give your chatbot a human voice that interrupts naturally and expresses emotion.

🎙️

Live Streaming

Generate real-time commentary or dubbing for live video content with zero delay.

🎮

Gaming NPCs

Create immersive characters that respond instantly to player actions with appropriate tone.

👁️

Accessibility

Instant screen reading for visually impaired users that feels like a real conversation.

Community Buzz

See what developers and researchers are saying about VibeVoice.