New Release: Microsoft VibeVoice-Realtime-0.5B

Real-Time Voice,
Zero Latency.

Experience the next generation of streaming speech synthesis. <300ms latency, emotional intelligence, and multi-role capabilities in a lightweight 0.5B model.

Download Model Read Paper

0.5BParameters

<300msFirst Token Latency

24kHzAudio Quality

See VibeVoice in Action

Watch how VibeVoice generates speech in real-time with ultra-low latency.

Demo taking too long to load? Server might be busy.

Open Demo in New Tab →

Why VibeVoice?

Zero Latency

Streaming speech synthesis with <300ms latency. Speaks before you finish typing.

300ms

0.5B Parameters

Ultra-lightweight model optimized for edge devices and real-time applications.

Streaming Input

Incrementally encodes incoming text chunks while generating audio in parallel.

Robust Long-form

Generates consistent, high-quality speech for long durations (~10 min) without degradation.

Emotional Intelligence

Automatically detects and expresses anger, excitement, sadness, and more.

Multi-Role Conversation

Supports up to 4 distinct speakers in a single session with consistent prosody.

Benchmark Performance

VibeVoiceRealTime-0.5B achieves state-of-the-art latency while maintaining high audio fidelity.

Latency

<300ms

First Audible Token

Model Size

0.5B

Parameters (Lightweight)

Context

Token Context Window

LibriSpeech Test-Clean

Model	WER (%) ↓	SIM ↑
VALL-E 2	2.40	0.643
Voicebox	1.90	0.662
MELLE	2.10	0.625
VibeVoice-0.5B	2.00	0.695

SEED Test-En

Model	WER (%) ↓	SIM ↑
MaskGCT	2.62	0.714
Seed-TTS	2.25	0.762
FireRedTTS	3.82	0.460
VibeVoice-0.5B	2.05	0.633

Get Started

1. Launch Docker

sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it nvcr.io/nvidia/pytorch:24.07-py3

2. Install from GitHub

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice/
pip install -e .

3. Run Inference

# Run real-time demo
python demo/vibevoice_realtime_demo.py --model_path microsoft/VibeVoice-Realtime-0.5B

Architecture

Interleaved Windowed Design

Processes text chunks immediately as they arrive, enabling "speak-while-thinking" capability.

Curriculum Learning

Progressively expands context window up to 8192 tokens for long-form coherence.

Diffusion Decoding

Uses DDPM with Classifier-Free Guidance for high-fidelity audio generation.

Built for Real-Time

🤖

AI Assistants

Give your chatbot a human voice that interrupts naturally and expresses emotion.

🎙️

Live Streaming

Generate real-time commentary or dubbing for live video content with zero delay.

🎮

Gaming NPCs

Create immersive characters that respond instantly to player actions with appropriate tone.

👁️

Accessibility

Instant screen reading for visually impaired users that feels like a real conversation.

Community Buzz

See what developers and researchers are saying about VibeVoice.

cocktail peanut

@cocktailpeanut

·Follow

1-Click Vibevoice-Realtime for ALL machines Finally a Realtime TTS that is ACTUALLY realtime, thanks to its tiny size! (0.5B params) This video is from my Windows machine, recorded in realtime. All you need is around 2.5GB VRAM, it even works on Macs!

Watch on X

@_akhaliq

Microsoft just released VibeVoice-Realtime-0.5B huggingface.co/microsoft/Vibe…

11:32 PM · Dec 4, 2025

413

Read 12 replies

Min Choi

@minchoi

·Follow

Microsoft just dropped VibeVoice-Realtime-0.5B Open-source realtime TTS AI model that starts talking in ~300 ms Streaming, long-form and insanely fast.

Watch on X

8:10 PM · Dec 4, 2025

·Follow

微软推出 VibeVoice-Realtime-0.5B 实时文本转语音模型几乎实时转录话还没说完即可开始😅 支持中英文，中文差一点主要特点是： 🕒 几乎实时发声（300毫秒） 🗣️ 声音自然流畅，能朗读长文本，可生成长达 90 分钟的流畅语音 💻

Watch on X

2:38 AM · Dec 5, 2025

477

Read 8 replies

steven

@Tu7uruu

·Follow

VibeVoice from Microsoft goes real-time! > 0.5B model tuned for ultra-low-latency speech > Lightweight architecture, high-fidelity outputs > Token-level streaming for instant feedback > Designed for real-time LLM interaction

Watch on X

6:39 PM · Dec 4, 2025

615

Read 8 replies

Real-Time Voice, Zero Latency.

See VibeVoice in Action

Why VibeVoice?

Zero Latency

0.5B Parameters

Streaming Input

Robust Long-form

Emotional Intelligence

Multi-Role Conversation

Benchmark Performance

Latency

Model Size

Context

LibriSpeech Test-Clean

SEED Test-En

Get Started

1. Launch Docker

2. Install from GitHub

3. Run Inference

Architecture

Interleaved Windowed Design

Curriculum Learning

Diffusion Decoding

Built for Real-Time

AI Assistants

Live Streaming

Gaming NPCs

Accessibility

Community Buzz

Real-Time Voice,
Zero Latency.