Version History
===============

This page summarizes all VoxCPM releases, including feature comparison, version highlights, and migration guidance.

Quick Comparison
****************

.. list-table::
   :widths: 28 24 24 24
   :header-rows: 1

   * - Feature
     - VoxCPM 1.0
     - VoxCPM 1.5
     - VoxCPM 2
   * - **Parameters**
     - 640M
     - 800M
     - 2B
   * - **Audio Output**
     - 16kHz
     - 44.1kHz
     - 48kHz
   * - **Languages**
     - 2 (zh, en)
     - 2 (zh, en)
     - 30
   * - **Patch Size**
     - 2
     - 4
     - 4
   * - **LM Token Rate**
     - 12.5Hz
     - 6.25Hz
     - 6.25Hz
   * - **Max Sequence Length**
     - 4096
     - 4096
     - 8192
   * - **Residual LM Fusion**
     - Additive
     - Additive
     - Concat + Projection
   * - **DiT Conditioning**
     - Single token (add)
     - Single token (add)
     - Multi-token (concat)
   * - **Reference Audio**
     - Prompt continuation
     - Prompt continuation
     - Isolated ref channel
   * - **Voice Design**
     - —
     - —
     - ✅
   * - **Style Control**
     - —
     - —
     - ✅
   * - **SFT / LoRA**
     - ✅
     - ✅
     - ✅
   * - **RTF (RTX 4090)**
     - ~0.17
     - ~0.15
     - ~0.3

For a detailed explanation of the architecture components (four-stage pipeline, AudioVAE, Local DiT), see :doc:`./architecture`.

VoxCPM 2
*********

VoxCPM 2 is the latest major release — a 2B parameter model trained on 2.36 million hours of multilingual data. It represents a significant leap in capacity, quality, and controllability over the 1.x series.

Key characteristics:

- 48kHz audio output via AudioVAE V2 (asymmetric 16kHz encode → 48kHz decode)
- 30-language multilingual support
- Voice Design: create a voice from natural-language description, no reference audio needed
- Style Control: control emotion, pace, and speaking style of a cloned voice via text tags
- Isolated reference channel for voice cloning (no matching transcript required)
- Concat-Projection residual LM fusion and multi-token DiT conditioning for richer expressiveness
- Built on a `MiniCPM-4 <https://github.com/OpenBMB/MiniCPM>`_ backbone

Use VoxCPM 2 for all new projects. It is the recommended default for multilingual synthesis, voice cloning, voice design, and production deployment.

VoxCPM 1.5
***********

VoxCPM 1.5 is the final 1.x upgrade before VoxCPM 2. It improves audio quality and efficiency while keeping the core context-aware generation and zero-shot voice cloning workflow familiar to existing 1.x users.

Key characteristics:

- 44.1kHz output
- 6.25Hz LM token rate
- patch size increased from 2 to 4
- simpler migration path for existing VoxCPM 1.0 users

Use VoxCPM 1.5 when you want a lighter Chinese/English checkpoint than VoxCPM 2, while keeping stronger output quality than VoxCPM 1.0.

VoxCPM 1.0
***********

VoxCPM 1.0 is the original tokenizer-free VoxCPM release. It remains useful as the baseline reference point for the family and for older experiments built around the original 0.5B checkpoint.

Key characteristics:

- 600M parameter size
- 16kHz output
- original VoxCPM architecture release
- benchmark reference for early VoxCPM results

Use VoxCPM 1.0 when you need the smallest historical checkpoint or want to compare against the original baseline behavior.

Migration Guidance
******************

- **New projects** should start with VoxCPM 2.
- **Existing VoxCPM 1.0 users** should generally move to VoxCPM 1.5 first if they need a lower-risk 1.x upgrade path.
- If you need multilingual synthesis, Voice Design, Style Control, or 48kHz output, move directly to VoxCPM 2.

Detailed Pages
**************

- Full VoxCPM 2 page: :doc:`./voxcpm2`
- Full VoxCPM 1.5 page: :doc:`./voxcpm1.5`
- Full VoxCPM 1.0 page: :doc:`./voxcpm1`