VoxCPM documentation

VoxCPM
VoxCPM

A realistic voice synthesis toolkit that brings authentic, expressive voices to your applications — powered by continuous-space diffusion autoregressive modeling.


🌟 Key Features

  • 🌍 30-Language Multilingual - Input text directly in any of the 30 supported languages in most cases, with no explicit language tag required.

  • 🎨 Voice Design - Create a brand-new voice from a natural-language description alone, with no reference audio required.

  • 🎛️ Controllable Cloning - Clone a voice from a short reference clip, then steer emotion, pace, and style while preserving the original timbre.

  • 🎙️ Ultimate Cloning - For maximum fidelity, provide both the reference audio and its transcript so the model can continue seamlessly from the prompt and preserve more vocal detail.

  • 🔊 48kHz High-Quality Audio - Accepts 16kHz reference audio and outputs 48kHz audio through AudioVAE V2’s asymmetric encode/decode design, with built-in super-resolution and no external upsampler required.

  • 🧠 Context-Aware Synthesis - Automatically infers appropriate prosody and expressiveness from the text itself for more natural, content-matched delivery.

  • Real-Time Streaming - Reaches an RTF as low as 0.13 on NVIDIA RTX 4090 with NanoVLLM-VoxCPM or vLLM-Omni for high-throughput, concurrent serving.

  • 📦 Fully Open-Source & Commercial-Ready - Weights and code are released under the Apache License 2.0, allowing commercial use.


Versions

VoxCPM 2 is the recommended release for new projects. Earlier releases remain available for lighter deployments, compatibility, and historical reference.

Earlier Releases
  • VoxCPM 1.5 for lighter Chinese/English deployment

  • VoxCPM 1.0 for baseline and historical reference

  • Compatibility and migration guidance for 1.x workflows


Community Projects

We’re excited to see the VoxCPM community growing. A few representative ecosystem projects:

  • NanoVLLM-VoxCPM for high-throughput GPU serving

  • vLLM-Omni for official VoxCPM2 serving on the upstream vLLM stack with continuous batching and an OpenAI-compatible API

  • VoxCPM.cpp for ggml / GGUF based CPU, CUDA, and Vulkan inference

  • VoxCPMANE for Apple Neural Engine deployment

  • ComfyUI-VoxCPM for node-based workflows and LoRA training

  • ComfyUI_RH_VoxCPM for full-featured ComfyUI workflows with multi-speaker dialogue, Voice Design, LoRA hot-swapping, and automatic ASR

  • MLX-Audio for Apple Silicon MLX-based audio inference, API serving, and web UI

  • TTS WebUI Extension for browser-based usage

See the sidebar Ecosystem section for full setup guides and more community integrations.

Tip

Have you built something cool with VoxCPM? We’d love to feature it here! Please open an issue or pull request to add your project.

Note

The community projects listed above are not officially maintained by OpenBMB.


Risks and Limitations

  • General Model Behavior: While VoxCPM has been trained on a large-scale dataset, it may still produce outputs that are unexpected, biased, or contain artifacts.

  • Potential for Misuse of Voice Cloning: VoxCPM’s powerful zero-shot voice cloning capability can generate highly realistic synthetic speech. This technology could be misused for creating convincing deepfakes for purposes of impersonation, fraud, or spreading disinformation. Users of this model must not use it to create content that infringes upon the rights of individuals. It is strictly forbidden to use VoxCPM for any illegal or unethical purposes. We strongly recommend that any publicly shared content generated with this model be clearly marked as AI-generated.

  • Current Technical Limitations: Although generally stable, the model may occasionally exhibit instability, especially with very long or expressive inputs. VoxCPM 2 introduces Voice Design and Style Control for more direct attribute control, though results may vary.

  • Language Coverage: VoxCPM 1.x is trained primarily on Chinese and English data. VoxCPM 2 extends support to 30 languages, though performance may vary across languages depending on training data availability.

  • Usage Restrictions: This model is released for research and development purposes. Commercial use is allowed, but we do not recommend its use in production or commercial applications without rigorous testing and safety evaluations. Please use VoxCPM responsibly.