VoxCPM documentation¶

A realistic voice synthesis toolkit that brings authentic, expressive voices to your applications — powered by continuous-space diffusion autoregressive modeling.

Get Started

View on GitHub

🌟 Key Features¶

🌍 30-Language Multilingual - Input text directly in any of the 30 supported languages in most cases, with no explicit language tag required.
🎨 Voice Design - Create a brand-new voice from a natural-language description alone, with no reference audio required.
🎛️ Controllable Cloning - Clone a voice from a short reference clip, then steer emotion, pace, and style while preserving the original timbre.
🎙️ Ultimate Cloning - For maximum fidelity, provide both the reference audio and its transcript so the model can continue seamlessly from the prompt and preserve more vocal detail.
🔊 48kHz High-Quality Audio - Accepts 16kHz reference audio and outputs 48kHz audio through AudioVAE V2’s asymmetric encode/decode design, with built-in super-resolution and no external upsampler required.
🧠 Context-Aware Synthesis - Automatically infers appropriate prosody and expressiveness from the text itself for more natural, content-matched delivery.
⚡ Real-Time Streaming - Reaches an RTF as low as 0.13 on NVIDIA RTX 4090 with NanoVLLM-VoxCPM or vLLM-Omni for high-throughput, concurrent serving.
📦 Fully Open-Source & Commercial-Ready - Weights and code are released under the Apache License 2.0, allowing commercial use.

Versions¶

VoxCPM 2 is the recommended release for new projects. Earlier releases remain available for lighter deployments, compatibility, and historical reference.

VoxCPM 2

Current version
30-Language Multilingual
Voice Design & Style Control
Native 48kHz Audio

Try Now →

Earlier Releases

VoxCPM 1.5 for lighter Chinese/English deployment
VoxCPM 1.0 for baseline and historical reference
Compatibility and migration guidance for 1.x workflows

View Earlier Releases →

Community Projects¶

We’re excited to see the VoxCPM community growing. A few representative ecosystem projects:

NanoVLLM-VoxCPM for high-throughput GPU serving
vLLM-Omni for official VoxCPM2 serving on the upstream vLLM stack with continuous batching and an OpenAI-compatible API
VoxCPM.cpp for ggml / GGUF based CPU, CUDA, and Vulkan inference
VoxCPMANE for Apple Neural Engine deployment
ComfyUI-VoxCPM for node-based workflows and LoRA training
ComfyUI_RH_VoxCPM for full-featured ComfyUI workflows with multi-speaker dialogue, Voice Design, LoRA hot-swapping, and automatic ASR
MLX-Audio for Apple Silicon MLX-based audio inference, API serving, and web UI
TTS WebUI Extension for browser-based usage

See the sidebar Ecosystem section for full setup guides and more community integrations.

Tip

Have you built something cool with VoxCPM? We’d love to feature it here! Please open an issue or pull request to add your project.

Note

The community projects listed above are not officially maintained by OpenBMB.

Risks and Limitations¶

General Model Behavior: While VoxCPM has been trained on a large-scale dataset, it may still produce outputs that are unexpected, biased, or contain artifacts.
Potential for Misuse of Voice Cloning: VoxCPM’s powerful zero-shot voice cloning capability can generate highly realistic synthetic speech. This technology could be misused for creating convincing deepfakes for purposes of impersonation, fraud, or spreading disinformation. Users of this model must not use it to create content that infringes upon the rights of individuals. It is strictly forbidden to use VoxCPM for any illegal or unethical purposes. We strongly recommend that any publicly shared content generated with this model be clearly marked as AI-generated.
Current Technical Limitations: Although generally stable, the model may occasionally exhibit instability, especially with very long or expressive inputs. VoxCPM 2 introduces Voice Design and Style Control for more direct attribute control, though results may vary.
Language Coverage: VoxCPM 1.x is trained primarily on Chinese and English data. VoxCPM 2 extends support to 30 languages, though performance may vary across languages depending on training data availability.
Usage Restrictions: This model is released for research and development purposes. Commercial use is allowed, but we do not recommend its use in production or commercial applications without rigorous testing and safety evaluations. Please use VoxCPM responsibly.

Star History¶

Citation¶

If you find our model helpful, please consider citing our work and starring the repository.

@article{voxcpm2025,
   title        = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
   author       = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and Gui, Jiancheng and Li, Kehan and Wu, Zhiyong  and Liu, Zhiyuan},
   journal      = {arXiv preprint arXiv:2509.24650},
   year         = {2025},
}

VoxCPM documentation¶

🌟 Key Features¶

Versions¶

Community Projects¶

Risks and Limitations¶

License¶

Acknowledgments¶

Institutions¶

Star History¶

Citation¶