Real-time local TTS (31M params, 5.6x CPU, voice cloning, ONNX)

Hi guys and gals, I made a TTS model based on my highly upgraded VITS base, conditioned on external speaker embeddings (Resemble AI's Resemblyzer). The model, with ~31M parameters (ONNX), is tuned for latency and local inference, and comes already exported. I was trying to push the limits of what I could do with small, fast models. Runs 5.6x realtime on a server CPU It supports voice cloning, voice blending (mix two or more speakers to make a new voice), the license is Apache 2.0 and it uses DeepPhonemizer (MIT) for the phonemization, so no license issues. The repo contains the checkpoint, how to run it, and links to Colab and HuggingFace demos. Now, because it's tiny, audio quality isn't the best, and as it was trained on LibriTTS-R + VCTK (both fully open datasets), speaker similarity isn't as good. Regardless, I hope it's useful.

AI Agent
Content Creation
Multilingual

Mar 18, 2026Visit website

✨ AI Summary

A lightweight, real-time text-to-speech model optimized for local CPU inference, offering voice cloning and blending capabilities. It prioritizes speed and low resource usage over the highest audio quality and speaker similarity.

Best For

Developers needing fast, local TTS for applications, Hobbyists experimenting with voice synthesis on limited hardware, Projects requiring permissive Apache 2.0 licensing

Why It Matters

It provides a fast, locally-runnable TTS solution with voice manipulation features, balancing capability with a small model size suitable for CPU deployment.

Key Features

Real-time text-to-speech with 5.6x faster than real-time performance on server CPUs
Voice cloning using external speaker embeddings from Resemblyzer
Voice blending to mix multiple speakers and create new voices
Local inference optimized for low latency with pre-exported ONNX model

Use Cases

A developer building an offline accessibility tool for a Raspberry Pi needs a lightweight TTS engine that can run without GPU acceleration and generate speech from user-provided text in real-time.
An indie game creator wants to add dynamic voice narration to their low-resource PC game by cloning a friend's voice for character dialogue without relying on cloud APIs or expensive licensing.
A researcher prototyping a personalized language learning app on a budget server needs to blend two accent samples to create a unique tutor voice for pronunciation exercises, ensuring all components are open-source.

Real-time local TTS (31M params, 5.6x CPU, voice cloning, ONNX)

✨ AI Summary

Key Features

Use Cases

Original Sources