Changelog
=========

This page documents **developer-visible** changes across VoxCPM releases — new
APIs, architecture flags, CLI commands, config fields, training script updates,
and dependency changes. If you are migrating between versions, read from your
current version forward.

----

VoxCPM 2.0 — March 2026
************************

.. important::

   **Breaking changes at a glance** — read before upgrading from 1.x:

   1. ``VoxCPM.from_pretrained()`` now defaults to ``openbmb/VoxCPM2``.
      If you rely on the default, your code will load the 2B model instead
      of VoxCPM 1.5. Pin explicitly if needed:
      ``VoxCPM.from_pretrained("openbmb/VoxCPM1.5")``.

   2. **Output sample rate changed**: 44.1 kHz (1.5) → **48 kHz** (2.0).
      Any code that hard-codes ``sf.write(..., 44100)`` must switch to
      ``model.tts_model.sample_rate`` (which returns ``48000`` for V2).

   3. **Gradio** dependency bumped to ``>=6,<7``.
      Gradio 5 apps will not install alongside VoxCPM 2. ``app_old.py``
      (the 1.5 demo) has been adapted to Gradio 6 as well.

   4. **CLI subcommand design**: the old flat ``voxcpm --text ...`` still
      works but prints a deprecation warning. Prefer
      ``voxcpm design|clone|batch``.


30-Language Multilingual
^^^^^^^^^^^^^^^^^^^^^^^^^

VoxCPM 2 extends language support from **2 (Chinese, English)** to **30
languages** across 8 language families, trained on 2.36 million hours of data
(1.8M zh+en base + 560K multilingual). The full language list is documented in
:doc:`../models/voxcpm2`.

At the code level, multilingual synthesis requires **no API changes** — simply
pass text in any supported language. Language detection is handled internally by
the model.


Model & Architecture
^^^^^^^^^^^^^^^^^^^^

- **New model class** ``VoxCPM2Model`` (``model/voxcpm2.py``).
  Existing ``VoxCPMModel`` (``model/voxcpm.py``) is unchanged and continues to
  serve 1.0 / 1.5 checkpoints.

- **Architecture auto-detection** via ``config.json`` → ``"architecture"`` field:

  - ``"voxcpm2"`` → ``VoxCPM2Model``
  - ``"voxcpm"`` (or key absent) → ``VoxCPMModel``

- **Parameter count**: 2B (up from 750M in 1.5).

- **Residual LM fusion**: additive → concat-projection.
  New ``fusion_concat_proj`` linear layer (``Linear(2h → h)``).

  .. code-block:: text

     # 1.x
     residual_input = lm_output + masked_audio_embed

     # 2.0
     residual_input = fusion_concat_proj(cat(lm_output, masked_audio_embed))

- **DiT conditioning**: single-token add → multi-token concat.
  ``VoxCPMLocDiTV2`` (``modules/locdit/local_dit_v2.py``) reshapes the concatenated
  LM + residual projections into multiple prefix tokens.

  .. code-block:: text

     # 1.x DiT input
     [ (mu + t) | cond | x ]       ← 1 conditioning token

     # 2.0 DiT input
     [ mu₁ | mu₂ | t | cond | x ]  ← 2 conditioning tokens + timestep

- **Isolated reference audio channel** with special tokens
  ``ref_audio_start_token = 103``, ``ref_audio_end_token = 104``.
  Enables four generation modes: zero-shot, continuation, reference-only,
  combined (ref + continuation).

- **Config defaults changed**:

  .. list-table::
     :widths: 40 20 20
     :header-rows: 1

     * - Field
       - 1.x
       - 2.0
     * - ``patch_size``
       - 2 / 4
       - 4
     * - ``residual_lm_num_layers``
       - 6
       - 8
     * - ``scalar_quantization_latent_dim``
       - 256
       - 512
     * - ``max_length``
       - 4096
       - 8192

- **New config field**: ``residual_lm_no_rope`` (bool, default ``False``).

- **``dit_mean_mode``** moved from root ``VoxCPMConfig`` (1.x) into nested
  ``VoxCPMDitConfig`` (2.0).


AudioVAE V2
^^^^^^^^^^^^

- **New module** ``AudioVAEV2`` (``modules/audiovae/audio_vae_v2.py``).

  .. list-table::
     :widths: 30 35 35
     :header-rows: 1

     * - Attribute
       - AudioVAE (v1)
       - AudioVAEV2
     * - ``decoder_dim``
       - 1536
       - 2048
     * - ``decoder_rates``
       - ``[8, 8, 5, 2]``
       - ``[8, 6, 5, 2, 2, 2]``
     * - Output sample rate
       - ``sample_rate`` (16 kHz / 44.1 kHz)
       - ``out_sample_rate`` (48 kHz output)
     * - Sample-rate conditioning
       - No
       - Yes (``SampleRateConditionLayer``)

- **Asymmetric encode/decode**: encoder at 16 kHz (640× downsample, 6.25 Hz
  token rate) → decoder at 48 kHz (1920× upsample).


Python API (``core.py``)
^^^^^^^^^^^^^^^^^^^^^^^^^

- **Default Hub model** changed from ``openbmb/VoxCPM1.5`` to ``openbmb/VoxCPM2``
  in ``VoxCPM.from_pretrained()``.

- **``generate()`` parameter comparison** (1.x vs 2.0):

  .. list-table::
     :widths: 30 15 15 40
     :header-rows: 1

     * - Parameter
       - 1.x
       - 2.0
       - Notes
     * - ``text``
       - yes
       - yes
       - In 2.0, prepend ``(instruction)`` for Voice Design / Style Control
     * - ``prompt_wav_path``
       - yes
       - yes
       - Continuation mode cloning (same as 1.x)
     * - ``prompt_text``
       - yes
       - yes
       - Must pair with ``prompt_wav_path``
     * - ``reference_wav_path``
       - **no**
       - **new**
       - Isolated voice cloning. Raises ``ValueError`` on 1.x models
     * - ``cfg_value``
       - yes
       - yes
       -
     * - ``inference_timesteps``
       - yes
       - yes
       -
     * - ``normalize``
       - yes
       - yes
       -
     * - ``denoise``
       - yes
       - yes
       - 2.0: also denoises ``reference_wav_path``
     * - ``streaming``
       - yes
       - yes
       -

- **Four generation modes** (V2 only) — determined by which audio arguments
  you pass:

  .. list-table::
     :widths: 25 20 20 35
     :header-rows: 1

     * - Mode
       - ``prompt_wav_path``
       - ``reference_wav_path``
       - Use case
     * - Zero-shot
       - ``None``
       - ``None``
       - Text-only synthesis (or Voice Design with ``(instruction)`` prefix)
     * - Continuation
       - set
       - ``None``
       - Seamless continuation from prompt audio (same as 1.x)
     * - Reference-only
       - ``None``
       - set
       - Isolated voice cloning from a reference clip
     * - Combined
       - set
       - set
       - Reference for timbre + prompt for context (best cloning similarity)

  .. code-block:: python

     # Reference-only cloning (V2 only)
     wav = model.generate(
         text="Hello world.",
         reference_wav_path="speaker.wav",
     )

     # Voice Design (V2 only) — describe a voice in parentheses
     wav = model.generate(
         text="(Warm female voice, mid-30s, calm tone) Welcome to VoxCPM 2.",
     )

     # Style Control (V2 only) — reference for timbre, instruction for style
     wav = model.generate(
         text="(Whispering, mysterious) The secret lies in the ancient library.",
         reference_wav_path="speaker.wav",
     )

- **``sample_rate`` property** on the inner model now returns the **output**
  rate: V1 uses ``audio_vae.sample_rate`` (16 kHz / 44.1 kHz), V2 uses
  ``audio_vae.out_sample_rate`` (**48 kHz**). Always use
  ``model.tts_model.sample_rate`` when saving audio:

  .. code-block:: python

     sf.write("output.wav", wav, model.tts_model.sample_rate)

- **Reference audio VAD trim** (V2 only): ``_trim_audio_silence_vad`` in
  ``VoxCPM2Model`` automatically trims trailing silence from reference audio
  using librosa-based energy detection.

- **``build_prompt_cache``** dispatch:

  - V2: accepts ``prompt_text``, ``prompt_wav_path``, ``reference_wav_path``.
    Returns a dict with ``mode`` (``"reference"`` / ``"continuation"`` /
    ``"ref_continuation"``).
  - V1: accepts ``prompt_text``, ``prompt_wav_path`` only. No ``mode`` key.

- **LoRA interface** on ``VoxCPM``: ``load_lora()``, ``unload_lora()``,
  ``set_lora_enabled()``, ``get_lora_state_dict()``, ``lora_enabled`` property.

- **Denoiser**: supports denoising both ``prompt_wav_path`` and
  ``reference_wav_path`` when ``denoise=True``.


CLI (``cli.py``)
^^^^^^^^^^^^^^^^^

- **Complete rewrite** — VoxCPM2-first subcommand design.

- **New subcommands**:

  - ``voxcpm design`` — text-to-speech with optional ``--control`` instruction.
  - ``voxcpm clone`` — voice cloning via ``--reference-audio`` and/or
    ``--prompt-audio`` + ``--prompt-text``.
  - ``voxcpm batch`` — batch processing from a text file.

- **New flags**:

  .. list-table::
     :widths: 35 65
     :header-rows: 1

     * - Flag
       - Description
     * - ``--control``
       - Voice design / style control instruction (prepended as ``(control)text``)
     * - ``--reference-audio`` / ``-ra``
       - Reference audio for isolated voice cloning (VoxCPM2 only)
     * - ``--prompt-file``
       - Load prompt text from a file
     * - ``--denoise``
       - Enhance prompt/reference audio
     * - ``--no-optimize``
       - Disable ``torch.compile``
     * - ``--no-denoiser``
       - Skip denoiser loading
     * - ``--zipenhancer-path``
       - Custom denoiser model path
     * - ``--lora-path``
       - Inference-time LoRA weights
     * - ``--lora-r`` / ``--lora-alpha`` / ``--lora-dropout``
       - LoRA config overrides at inference
     * - ``--lora-disable-lm``
       - Disable LoRA on LM layers
     * - ``--lora-disable-dit``
       - Disable LoRA on DiT layers
     * - ``--lora-enable-proj``
       - Enable LoRA on projection layers

- **Default HF model**: ``openbmb/VoxCPM2`` (constant ``DEFAULT_HF_MODEL_ID``).

- **Architecture detection** (``detect_model_architecture``): reads local
  ``config.json`` or infers from HF id string. ``--reference-audio`` is
  rejected on 1.x models.

- **Legacy root arguments** (``voxcpm --text ...``) still work but print a
  deprecation warning via ``warn_legacy_mode()``.

- **Control instruction wiring**: ``build_final_text(text, control)`` produces
  ``"(control)text"`` — the convention used by VoxCPM 2 for Voice Design and
  Style Control.


Controllable Generation
^^^^^^^^^^^^^^^^^^^^^^^^

Both features use the same convention: place a natural-language instruction
inside parentheses ``()`` before the target text.

- **Voice Design**: generate speech from a natural-language description without
  reference audio. Use ``(description)`` prefix in text, or ``--control`` in
  CLI.

  .. code-block:: bash

     # CLI
     voxcpm design \
       --text "Welcome to VoxCPM 2." \
       --control "Young female voice, warm and gentle" \
       --output out.wav

  .. code-block:: python

     # Python — the control instruction is part of the text string
     wav = model.generate(
         text="(Young female voice, warm and gentle) Welcome to VoxCPM 2.",
     )

- **Style Control**: control speaking style while using reference audio for
  timbre. The reference determines **who** speaks; the instruction controls
  **how** they speak.

  .. code-block:: bash

     voxcpm clone \
       --text "The secret lies hidden in the ancient library." \
       --control "Speaking slowly with a whispering tone" \
       --reference-audio ref.wav \
       --output out.wav


Training Script
^^^^^^^^^^^^^^^^

- ``scripts/train_voxcpm_finetune.py`` now **auto-detects** ``VoxCPMModel`` vs
  ``VoxCPM2Model`` from the pretrained checkpoint's ``config.json``. No
  separate VoxCPM2 training script needed.

- **New training parameter** ``grad_accum_steps`` — gradient accumulation for
  effective larger batch size without extra VRAM.

- **Validation improvements**: generates sample audio and mel spectrograms to
  TensorBoard at each ``valid_interval``.

- **Signal handler**: catches ``SIGTERM`` / ``SIGINT`` and saves a checkpoint
  before exiting.

- **DDP**: manual epoch management with ``DistributedSampler.set_epoch()`` for
  correct shuffle across epochs.

- **Checkpoint format**:

  - LoRA: ``lora_weights.safetensors`` (or ``.ckpt``), ``lora_config.json``
    with ``base_model`` and ``lora_config`` fields.
  - Full SFT: ``model.safetensors`` (or ``pytorch_model.bin``), plus copied
    ``config.json``, ``audiovae.pth``/``audiovae.safetensors``, and tokenizer
    files.
  - Both: ``optimizer.pth``, ``scheduler.pth``, ``training_state.json``.
  - ``latest/`` folder updated on every save for easy resume.

- **LoRA distribution flag**: ``distribute: true`` + ``hf_model_id`` in YAML
  saves the HF id (instead of a local path) as ``base_model`` in
  ``lora_config.json`` for easier sharing.


LoRA
^^^^^

- **V2 LoRA target modules**: ``target_proj_modules`` now includes
  ``fusion_concat_proj`` (in addition to ``enc_to_lm_proj``,
  ``lm_to_dit_proj``, ``res_to_dit_proj``).

- **``LoRALinear``** stores ``scaling`` as a non-persistent buffer to avoid
  ``torch.compile`` recompilation when toggling LoRA.


``torch.compile`` Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **Both** ``VoxCPMModel`` and ``VoxCPM2Model`` support ``optimize()`` method:

  - Compiles ``base_lm.forward_step``, ``residual_lm.forward_step``,
    ``feat_encoder``, ``feat_decoder.estimator`` with
    ``mode="reduce-overhead"``, ``fullgraph=True``.
  - Requires CUDA + Triton; gracefully skips on other backends.

- ``optimize=True`` by default in ``VoxCPM.__init__`` / ``from_pretrained``.
  Use ``--no-optimize`` in CLI to disable.

- Warm-up call after model load to trigger initial compilation.


Dependencies (``pyproject.toml``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- ``gradio>=6,<7`` (was ``gradio<6``).
- ``torch>=2.5.0``, ``torchaudio>=2.5.0`` (minimum bumped).
- Added ``torchcodec``, ``safetensors``, ``argbind``.
- Removed ``mypy`` from dev dependencies.
- Package version now managed by ``setuptools_scm`` (git-tag-based, no
  hard-coded ``__version__``).
- Entry point: ``voxcpm = "voxcpm.cli:main"``.


Demo App (``app.py``)
^^^^^^^^^^^^^^^^^^^^^^

- Full rewrite targeting VoxCPM 2:

  - Default model ``openbmb/VoxCPM2``.
  - Voice Design + Style Control via ``control_instruction`` field.
  - Reference audio + optional continuation (Hi-Fi) path.
  - i18n (English / 中文) support.
  - Gradio 6 patterns (theme/css passed to ``launch()``).

- Original 1.5 demo preserved as ``app_old.py``.


----

VoxCPM 1.5.0 — December 5, 2025
*********************************

Model & Architecture
^^^^^^^^^^^^^^^^^^^^

- **AudioVAE sampling rate**: 16 kHz → **44.1 kHz**. Preserves more
  high-frequency detail for voice cloning.

- **LM token rate**: 12.5 Hz → **6.25 Hz** (halved). Reduces computational
  cost per second of audio.

- **Patch size**: 2 → **4** (LocEnc & LocDiT). Encoder/decoder process longer
  patches, requiring deeper local modules → slightly larger total parameter
  count (**~750M**).

- **RTF**: ~0.15 on RTX 4090 (comparable to 1.0 despite larger model).

- **config.json** ``architecture`` value: ``"voxcpm"`` (same code path as
  1.0).

- Same model class ``VoxCPMModel`` as 1.0.
  All 1.0 → 1.5 differences are in checkpoint weights and config values,
  **not** in a separate Python class.


Fine-tuning
^^^^^^^^^^^^

- **SFT** (full fine-tuning) and **LoRA** fine-tuning officially supported.

- **Training script** ``scripts/train_voxcpm_finetune.py`` added.

- **Training configs** ``conf/voxcpm_v1.5/``:

  - ``voxcpm_finetune_all.yaml`` — full SFT, ``sample_rate: 44100``,
    ``learning_rate: 1e-5``.
  - ``voxcpm_finetune_lora.yaml`` — LoRA, ``sample_rate: 44100``,
    ``learning_rate: 1e-4``, ``r: 8``, ``alpha: 16``.

- **LoRA WebUI** ``lora_ft_webui.py`` added for browser-based LoRA
  training / inference.

- **Inference test scripts**:

  - ``scripts/test_voxcpm_ft_infer.py`` — full fine-tune checkpoint inference.
  - ``scripts/test_voxcpm_lora_infer.py`` — LoRA checkpoint inference with
    hot-swap demo (``load_lora`` / ``unload_lora`` / ``set_lora_enabled``).


Python API
^^^^^^^^^^^

- ``VoxCPM.from_pretrained()`` default: ``openbmb/VoxCPM1.5``.

- **Streaming** ``generate_streaming()`` API added (returns a generator of
  audio chunks).


Stability Improvements
^^^^^^^^^^^^^^^^^^^^^^^

- Reduced beginning/ending audio artifacts through improved inference logic and
  training data cleaning.

- Lower token rate (6.25 Hz) improves stability on longer speech.


----

VoxCPM 1.0.0 — September 16, 2025
************************************

Initial public release of VoxCPM.

Model
^^^^^^

- **Parameter size**: 600M (VoxCPM-0.5B).
- **Sampling rate**: 16 kHz (AudioVAE v1).
- **LM token rate**: 12.5 Hz, patch size 2.
- **Languages**: Chinese and English.

Python API
^^^^^^^^^^^

- ``VoxCPM`` class with ``from_pretrained()`` / ``generate()`` interface.
- HF model id: ``openbmb/VoxCPM-0.5B``.
- Voice cloning via ``prompt_wav_path`` + ``prompt_text`` (continuation mode
  only).

Training Configs
^^^^^^^^^^^^^^^^^

- ``conf/voxcpm_v1/voxcpm_finetune_all.yaml`` — ``sample_rate: 16000``.
- ``conf/voxcpm_v1/voxcpm_finetune_lora.yaml`` — ``sample_rate: 16000``.

PyPI
^^^^^

- Package ``voxcpm`` published with tags ``1.0.0rc1`` through ``1.0.5``.


----

Version Tags
*************

Versions are managed by ``setuptools_scm`` from git tags. There is no
hard-coded ``__version__`` in the source.

.. list-table::
   :widths: 20 20 60
   :header-rows: 1

   * - Tag
     - Date
     - Notes
   * - ``1.0.0rc1`` – ``1.0.0rc3``
     - 2025-09-16
     - Release candidates
   * - ``1.0.0``
     - 2025-09-16
     - Initial release
   * - ``1.0.1``
     - 2025-09-16
     - Patch
   * - ``1.0.2``
     - 2025-09-17
     - Patch
   * - ``1.0.3``
     - 2025-09-18
     - Patch
   * - ``1.0.4``
     - 2025-09-22
     - Patch
   * - ``1.0.5``
     - 2025-09-30
     - Patch (technical report release)
   * - ``1.5.0``
     - 2025-12-05
     - VoxCPM 1.5