Real-time local TTS (31M params, 5.6x CPU, voice cloning, ONNX)
Hi guys and gals, I made a TTS model based on my highly upgraded VITS base, conditioned on external speaker embeddings (Resemble AI's Resemblyzer). The model, with ~31M parameters (ONNX), is tuned for latency and local inference, and comes already exported. I was trying to push the limits of what I could do with small, fast models. Runs 5.6x realtime on a server CPU It supports voice cloning, voice blending (mix two or more speakers to make a new voice), the license is Apache 2.0 and it uses DeepPhonemizer (MIT) for the phonemization, so no license issues. The repo contains the checkpoint, how to run it, and links to Colab and HuggingFace demos. Now, because it's tiny, audio quality isn't the best, and as it was trained on LibriTTS-R + VCTK (both fully open datasets), speaker similarity isn't as good. Regardless, I hope it's useful.
- AI Agent
- Content Creation
- Multilingual
✨ AI Summary
A lightweight, real-time text-to-speech model optimized for local CPU inference, offering voice cloning and blending capabilities. It prioritizes speed and low resource usage over the highest audio quality and speaker similarity.
Best For
Developers needing fast, local TTS for applications, Hobbyists experimenting with voice synthesis on limited hardware, Projects requiring permissive Apache 2.0 licensing
Why It Matters
It provides a fast, locally-runnable TTS solution with voice manipulation features, balancing capability with a small model size suitable for CPU deployment.
Key Features
- Real-time text-to-speech with 5.6x faster than real-time performance on server CPUs
- Voice cloning using external speaker embeddings from Resemblyzer
- Voice blending to mix multiple speakers and create new voices
- Local inference optimized for low latency with pre-exported ONNX model
Use Cases
- A developer building an offline accessibility tool for a Raspberry Pi needs a lightweight TTS engine that can run without GPU acceleration and generate speech from user-provided text in real-time.
- An indie game creator wants to add dynamic voice narration to their low-resource PC game by cloning a friend's voice for character dialogue without relying on cloud APIs or expensive licensing.
- A researcher prototyping a personalized language learning app on a budget server needs to blend two accent samples to create a unique tutor voice for pronunciation exercises, ensuring all components are open-source.