API Reference

Python API

class VoxCPM(voxcpm_model_path, zipenhancer_model_path='iic/speech_zipenhancer_ans_multiloss_16k_base', enable_denoiser=True, optimize=True, device=None, lora_config=None, lora_weights_path=None)

Initialize VoxCPM from a local model directory.

The model architecture (voxcpm or voxcpm2) is auto-detected from the architecture field in config.json.

Parameters:
  • voxcpm_model_path (str) – Local path to the model directory containing weights, configs, and tokenizer files.

  • zipenhancer_model_path (str|None) – ModelScope denoiser model id or local path. Set to None to skip the denoiser entirely.

  • enable_denoiser (bool) – Whether to initialize the ZipEnhancer denoiser pipeline.

  • optimize (bool) – Enable torch.compile acceleration. Disable for debugging or unsupported platforms. This optimization is primarily useful on CUDA.

  • device (str|None) – Runtime device. Use None or "auto" for automatic fallback (cuda -> mps -> cpu), or an explicit value such as "cpu", "mps", "cuda", or "cuda:0". Explicit requests raise an error instead of silently falling back.

  • lora_config (LoRAConfig|None) – LoRA configuration. If lora_weights_path is provided without this, a default config (enable_lm=True, enable_dit=True) is created automatically.

  • lora_weights_path (str|None) – Path to pre-trained LoRA weights (.pth file or directory containing lora_weights.ckpt).

model = VoxCPM(
    voxcpm_model_path="/path/to/VoxCPM2",
    enable_denoiser=False,
    device="auto",
)
classmethod VoxCPM.from_pretrained(hf_model_id='openbmb/VoxCPM2', load_denoiser=True, zipenhancer_model_id='iic/speech_zipenhancer_ans_multiloss_16k_base', cache_dir=None, local_files_only=False, optimize=True, device=None, lora_config=None, lora_weights_path=None, **kwargs)

Instantiate VoxCPM from a Hugging Face Hub snapshot. Downloads model weights automatically on first use.

Parameters:
  • hf_model_id (str) – Hugging Face repo id (e.g. "openbmb/VoxCPM2") or local directory path.

  • load_denoiser (bool) – Whether to initialize the denoiser pipeline.

  • zipenhancer_model_id (str) – Denoiser model id or local path. Ignored when load_denoiser=False.

  • cache_dir (str|None) – Custom cache directory for the snapshot download.

  • local_files_only (bool) – If True, only use local files and do not attempt to download.

  • optimize (bool) – Enable torch.compile acceleration. This is primarily a CUDA optimization.

  • device (str|None) – Runtime device. None / "auto" uses automatic fallback. Explicit choices such as "cpu", "mps", "cuda", and "cuda:0" are validated and do not auto-fallback.

  • lora_config (LoRAConfig|None) – LoRA configuration for fine-tuned models.

  • lora_weights_path (str|None) – Path to LoRA weights. If provided, LoRA is loaded after initialization.

Returns:

Initialized VoxCPM instance.

Return type:

VoxCPM

Raises:

ValueError – If hf_model_id is empty.

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
    device="auto",
)
VoxCPM.generate(text, prompt_wav_path=None, prompt_text=None, reference_wav_path=None, cfg_value=2.0, inference_timesteps=10, min_len=2, max_len=4096, normalize=False, denoise=False, retry_badcase=True, retry_badcase_max_times=3, retry_badcase_ratio_threshold=6.0)

Synthesize speech from text.

Parameters:
  • text (str) – Input text to synthesize. For Voice Design, prepend control instructions in parentheses: "(warm female voice)Hello".

  • prompt_wav_path (str|None) – Prompt audio path for continuation-style cloning. Must be paired with prompt_text. For Hi-Fi cloning, combine it with reference_wav_path.

  • prompt_text (str|None) – Exact transcript of the prompt audio. Must be provided together with prompt_wav_path.

  • reference_wav_path (str|None) – Reference audio path for isolated voice cloning (VoxCPM 2 only). Can be used alone or combined with prompt_wav_path + prompt_text.

  • cfg_value (float) – Guidance scale. Higher values follow the conditioning more strictly; lower values allow more variation. Recommended: 1.0–3.0.

  • inference_timesteps (int) – Number of diffusion steps. More steps improve detail at the cost of speed. Recommended: 4–30.

  • min_len (int) – Minimum audio length in model tokens.

  • max_len (int) – Maximum token length during generation. Increase for very long outputs.

  • normalize (bool) – Run text normalization (expand numbers, dates, etc.) before generation.

  • denoise (bool) – Denoise prompt/reference audio before generation. Requires the denoiser to be loaded.

  • retry_badcase (bool) – Automatically retry when the generated audio length is abnormally short or long.

  • retry_badcase_max_times (int) – Maximum number of bad-case retries.

  • retry_badcase_ratio_threshold (float) – Audio-to-text duration ratio threshold for bad-case detection.

Returns:

1-D waveform array (float32). Sample rate is available at model.tts_model.sample_rate.

Return type:

numpy.ndarray

Raises:
  • ValueError – If text is empty.

  • ValueError – If prompt_wav_path and prompt_text are not both provided or both None.

  • ValueError – If reference_wav_path is used with a VoxCPM 1.x model.

  • FileNotFoundError – If audio file paths do not exist.

# Voice Design
wav = model.generate(
    text="(warm female voice)Hello from VoxCPM!",
    cfg_value=2.0,
)

# Reference-only cloning (VoxCPM 2)
wav = model.generate(
    text="Hello from VoxCPM!",
    reference_wav_path="speaker.wav",
)

# Hi-Fi cloning
wav = model.generate(
    text="Hello from VoxCPM!",
    prompt_wav_path="speaker.wav",
    prompt_text="Exact transcript of speaker.wav.",
    reference_wav_path="speaker.wav",
)
VoxCPM.generate_streaming(text, prompt_wav_path=None, prompt_text=None, reference_wav_path=None, cfg_value=2.0, inference_timesteps=10, min_len=2, max_len=4096, normalize=False, denoise=False, retry_badcase=True, retry_badcase_max_times=3, retry_badcase_ratio_threshold=6.0)

Same interface as generate(), but returns a generator that yields audio chunks incrementally.

All parameters are identical to generate().

Returns:

Generator yielding 1-D waveform chunks (float32).

Return type:

Generator[numpy.ndarray, None, None]

import numpy as np

chunks = []
for chunk in model.generate_streaming(text="Streaming output."):
    chunks.append(chunk)
wav = np.concatenate(chunks)
VoxCPM.load_lora(lora_weights_path)

Load LoRA weights from a checkpoint file or directory.

Parameters:

lora_weights_path (str) – Path to LoRA weights (.pth file or directory containing lora_weights.ckpt).

Returns:

(loaded_keys, skipped_keys) — lists of loaded and skipped parameter names.

Return type:

tuple[list[str], list[str]]

Raises:

RuntimeError – If model was not initialized with a LoRA config.

VoxCPM.unload_lora()

Reset all LoRA weights to their initial state (effectively zeroing them out). The LoRA layers remain in the model but have no effect.

VoxCPM.set_lora_enabled(enabled)

Enable or disable LoRA layers without unloading weights.

Parameters:

enabled (bool) – True to activate LoRA; False to use the base model only.

VoxCPM.get_lora_state_dict()

Get the current LoRA parameters.

Returns:

State dict containing all lora_A and lora_B parameters.

Return type:

dict

VoxCPM.lora_enabled: bool

True if a LoRA config is currently loaded on this model.


CLI

The voxcpm command provides three subcommands. Default model: openbmb/VoxCPM2.

Subcommands

voxcpm design

Generate speech from text without any reference audio. Optionally describe the target voice with --control.

voxcpm design --text "Hello world" --output out.wav
voxcpm design --text "Hello world" --control "warm female voice" --output out.wav
voxcpm clone

Clone a voice using reference audio or prompt audio with transcript.

# Reference-only cloning (VoxCPM 2)
voxcpm clone --text "Hello" --reference-audio ref.wav --output out.wav

# Hi-Fi cloning
voxcpm clone --text "Hello" \
    --prompt-audio ref.wav --prompt-text "Transcript of ref.wav" \
    --reference-audio ref.wav --output out.wav

# With style control
voxcpm clone --text "Hello" --reference-audio ref.wav \
    --control "speaking slowly" --output out.wav
voxcpm batch

Process a text file where each line becomes a separate output WAV (output_001.wav, output_002.wav, …).

voxcpm batch --input texts.txt --output-dir ./outs
voxcpm batch --input texts.txt --output-dir ./outs --reference-audio ref.wav

Arguments

Generation

--text, -t <TEXT>

Text to synthesize.

--control <INSTRUCTION>

Control instruction for voice design or style control (e.g. "warm female voice"). Cannot be used together with --prompt-text.

--cfg-value <FLOAT>

CFG guidance scale. Default: 2.0. Recommended: 1.0–3.0.

--inference-timesteps <INT>

Number of diffusion steps. Default: 10. Recommended: 4–30.

--normalize

Enable text normalization (expand numbers, dates, etc.).

Prompt & Reference Audio

--prompt-audio, -pa <PATH>

Prompt audio file for continuation mode. Requires --prompt-text or --prompt-file.

--prompt-text, -pt <TEXT>

Text transcript of the prompt audio.

--prompt-file <PATH>

Text file containing the prompt transcript (alternative to --prompt-text).

--reference-audio, -ra <PATH>

Reference audio for isolated voice cloning (VoxCPM 2 only).

--denoise

Denoise prompt/reference audio with ZipEnhancer before generation.

Model Loading

--model-path <PATH>

Local model directory. If set, --hf-model-id is ignored.

--hf-model-id <ID>

Hugging Face repo id. Default: openbmb/VoxCPM2.

--device <DEVICE>

Runtime device selection. Supported values are auto, cpu, mps, cuda, and indexed CUDA devices such as cuda:0. auto prefers cuda -> mps -> cpu. Explicit values raise an error instead of silently falling back.

--cache-dir <PATH>

Cache directory for Hub downloads.

--local-files-only

Only use local files, do not download from Hub.

--no-denoiser

Skip loading the denoiser model.

--no-optimize

Disable torch.compile acceleration. This is commonly useful for debugging, unsupported platforms, and non-CUDA environments.

--zipenhancer-path <PATH>

Custom ZipEnhancer model id or local path.

LoRA

--lora-path <PATH>

Path to LoRA weights directory.

--lora-r <INT>

LoRA rank. Default: 32.

--lora-alpha <INT>

LoRA alpha (scaling = alpha / r). Default: 16.

--lora-dropout <FLOAT>

LoRA dropout rate (0.0–1.0). Default: 0.0.

--lora-disable-lm

Disable LoRA on LM layers.

--lora-disable-dit

Disable LoRA on DiT layers.

--lora-enable-proj

Enable LoRA on projection layers.

Note

The legacy flat CLI (voxcpm --text "..." --output out.wav) still works but is deprecated. Prefer the subcommand style.