:orphan:

VoxCPM 1.0
==========


.. image:: https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-OpenBMB-yellow
   :target: https://huggingface.co/openbmb/VoxCPM-0.5B
   :alt: Hugging Face

.. image:: https://img.shields.io/badge/ModelScope-OpenBMB-purple
   :target: https://modelscope.cn/models/OpenBMB/VoxCPM-0.5B
   :alt: ModelScope

.. image:: https://img.shields.io/badge/Audio%20Samples-Page-green
   :target: https://openbmb.github.io/VoxCPM-demopage
   :alt: Audio Samples


* **Release Date:** September 16, 2025
* **Parameter Size:** 600M
* **Sampling Rate:** 16kHz

.. note::

   VoxCPM 1.0 is a legacy baseline release kept for compatibility and historical reference.

Overview
********

VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.

Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierarchical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.

Architecture
************

.. figure:: /_static/voxcpm1/voxcpm_model.png
    :width: 100%
    :align: center
    :alt: VoxCPM 1.0 Architecture
    :class: no-scaled-link

Getting Started
***************

For installation, loading, and the generic ``generate()`` workflow, start with :doc:`../quickstart`.

Use this checkpoint when you specifically want the original 0.5B VoxCPM release with 16kHz output.

Benchmarks
**********

VoxCPM achieves competitive results on public zero-shot TTS benchmarks:

Seed-TTS-eval Benchmark
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. table::
    :widths: auto
    :align: center

    ================= ========== =========== ============== ============== ============== ============== ================ ================
    Model             Parameters Open-Source test-EN                       test-ZH                       test-Hard                        
    ----------------- ---------- ----------- ----------------------------- ----------------------------- ---------------------------------
        /                                     WER/%⬇         SIM/%⬆         CER/%⬇         SIM/%⬆         CER/%⬇           SIM/%⬆          
    ================= ========== =========== ============== ============== ============== ============== ================ ================
    MegaTTS3          0.5B       ❌           2.79           77.1           1.52           79.0           /                /               
    DiTAR             0.6B       ❌           1.69           73.5           1.02           75.3           /                /               
    CosyVoice3        0.5B       ❌           2.02           71.8           1.16           78.0           6.08             75.8            
    CosyVoice3        1.5B       ❌           2.22           72.0           1.12           78.1           5.83             75.8            
    Seed-TTS          /          ❌           2.25           76.2           1.12           79.6           7.59             77.6            
    MiniMax-Speech    /          ❌           1.65           69.2           0.83           78.3           /                /               
    CosyVoice         0.3B       ✅           4.29           60.9           3.63           72.3           11.75            70.9            
    CosyVoice2        0.5B       ✅           3.09           65.9           1.38           75.7           **6.83**         72.4            
    F5-TTS            0.3B       ✅           2.00           67.0           1.53           76.0           8.67             71.3            
    SparkTTS          0.5B       ✅           3.14           57.3           1.54           66.0           /                /               
    FireRedTTS        0.5B       ✅           3.82           46.0           1.51           63.5           17.45            62.1            
    FireRedTTS-2      1.5B       ✅           1.95           66.5           1.14           73.6           /                /               
    Qwen2.5-Omni      7B         ✅           2.72           63.2           1.70           75.2           7.97             **74.7**        
    OpenAudio-s1-mini 0.5B       ✅           1.94           55.0           1.18           68.5           /                /               
    IndexTTS2         1.5B       ✅           2.23           70.6           1.03           76.5           /                /               
    VibeVoice         1.5B       ✅           3.04           68.9           1.16           74.4           /                /               
    HiggsAudio-v2     3B         ✅           2.44           67.7           1.50           74.0           /                /               
    **VoxCPM**        0.5B       ✅           **1.85**       **72.9**       **0.93**       **77.2**       8.87             73.0            
    ================= ========== =========== ============== ============== ============== ============== ================ ================

CV3-eval Benchmark
^^^^^^^^^^^^^^^^^^^^^^^^^^
.. table::
    :widths: auto
    :align: center

    ================= ======== ======== ======= ====== ======= ======== ====== =======
      Model             zh       en       hard/zh               hard/en                
    ----------------- -------- -------- ---------------------- -----------------------
          /           CER/%⬇   WER/%⬇   CER/%⬇  SIM/%⬆ DNSMOS⬆ WER/%⬇   SIM/%⬆ DNSMOS⬆
    ================= ======== ======== ======= ====== ======= ======== ====== =======
    F5-TTS            5.47     8.90     /       /      /       /        /      /      
    SparkTTS          5.15     11.0     /       /      /       /        /      /      
    GPT-SoVits        7.34     12.5     /       /      /       /        /      /      
    CosyVoice2        4.08     6.32     12.58   72.6   3.81    11.96    66.7   3.95   
    OpenAudio-s1-mini 4.00     5.54     18.1    58.2   3.77    12.4     55.7   3.89   
    IndexTTS2         3.58     4.45     12.8    74.6   3.65    /        /      /      
    HiggsAudio-v2     9.54     7.89     41.0    60.2   3.39    10.3     61.8   3.68   
    CosyVoice3-0.5B   3.89     5.24     14.15   78.6   3.75    9.04     75.9   3.92   
    CosyVoice3-1.5B   3.91     4.99     9.77    78.5   3.79    10.55    76.1   3.95   
    **VoxCPM**        **3.40** **4.04** 12.9    66.1   3.59    **7.89** 64.3   3.74   
    ================= ======== ======== ======= ====== ======= ======== ====== =======