microsoft/VibeVoice
Summary
Microsoft's VibeVoice is an MIT-licensed, Whisper-style speech-to-text model that features integrated speaker diarization. It allows for the transcription of audio files while simultaneously identifying and labeling different speakers within the same process.
Key Points
- Features built-in speaker diarization within the model architecture.
- Available as a 4-bit quantized MLX conversion (
mlx-community/VibeVoice-ASR-4bit) for efficient execution on Apple Silicon. - Supports both
.wavand.mp3input formats. - Outputs transcription data in a JSON array of objects containing
text,start,end,duration, andspeaker_id. - Imposes a maximum processing limit of one hour of audio per session.
- The default
--max-tokensparameter is 8192, which is sufficient for approximately 25 minutes of audio; larger files require manual increases to this value.
Technical Details
The model can be executed on macOS using uv and mlx-audio. Benchmarks on an M5 Max MacBook Pro (128GB) demonstrate that one hour of audio can be processed in approximately 524.79 seconds. During execution, the prefill stage can reach memory usage of approximately 61.5 GB, while the generation phase utilizes around 18 GB. Token throughput is measured at approximately 50.718 tokens/sec for prompts and 38.585 tokens/sec for generation.
The output is structured as a JSON array where each segment includes precise timestamps and a speaker_id. Because the model is limited to one hour of audio per run, processing longer files requires manual segmentation. To maintain transcription integrity, segments should be split with approximately one minute of overlap to prevent errors at the split points, and speaker_id values must be manually re-aligned across the different segments.
Impact / Why It Matters
VibeVoice provides a high-performance, self-hostable solution for developers needing integrated transcription and diarization. Its compatibility with MLX-based workflows makes it particularly efficient for deployment on Apple Silicon hardware.