API Reference¶

Python API¶

class VoxCPM(voxcpm_model_path, zipenhancer_model_path='iic/speech_zipenhancer_ans_multiloss_16k_base', enable_denoiser=True, optimize=True, device=None, lora_config=None, lora_weights_path=None)¶

Initialize VoxCPM from a local model directory.

The model architecture (voxcpm or voxcpm2) is auto-detected from the architecture field in config.json.

Parameters:

voxcpm_model_path (str) – Local path to the model directory containing weights, configs, and tokenizer files.
zipenhancer_model_path (str|None) – ModelScope denoiser model id or local path. Set to None to skip the denoiser entirely.
enable_denoiser (bool) – Whether to initialize the ZipEnhancer denoiser pipeline.
optimize (bool) – Enable torch.compile acceleration. Disable for debugging or unsupported platforms. This optimization is primarily useful on CUDA.
device (str|None) – Runtime device. Use None or "auto" for automatic fallback (cuda -> mps -> cpu), or an explicit value such as "cpu", "mps", "cuda", or "cuda:0". Explicit requests raise an error instead of silently falling back.
lora_config (LoRAConfig|None) – LoRA configuration. If lora_weights_path is provided without this, a default config (enable_lm=True, enable_dit=True) is created automatically.
lora_weights_path (str|None) – Path to pre-trained LoRA weights (.pth file or directory containing lora_weights.ckpt).

model = VoxCPM(
    voxcpm_model_path="/path/to/VoxCPM2",
    enable_denoiser=False,
    device="auto",
)

classmethod VoxCPM.from_pretrained(hf_model_id='openbmb/VoxCPM2', load_denoiser=True, zipenhancer_model_id='iic/speech_zipenhancer_ans_multiloss_16k_base', cache_dir=None, local_files_only=False, optimize=True, device=None, lora_config=None, lora_weights_path=None, **kwargs)¶

Instantiate VoxCPM from a Hugging Face Hub snapshot. Downloads model weights automatically on first use.

Parameters:

hf_model_id (str) – Hugging Face repo id (e.g. "openbmb/VoxCPM2") or local directory path.
load_denoiser (bool) – Whether to initialize the denoiser pipeline.
zipenhancer_model_id (str) – Denoiser model id or local path. Ignored when load_denoiser=False.
cache_dir (str|None) – Custom cache directory for the snapshot download.
local_files_only (bool) – If True, only use local files and do not attempt to download.
optimize (bool) – Enable torch.compile acceleration. This is primarily a CUDA optimization.
device (str|None) – Runtime device. None / "auto" uses automatic fallback. Explicit choices such as "cpu", "mps", "cuda", and "cuda:0" are validated and do not auto-fallback.
lora_config (LoRAConfig|None) – LoRA configuration for fine-tuned models.
lora_weights_path (str|None) – Path to LoRA weights. If provided, LoRA is loaded after initialization.

Returns:

Initialized VoxCPM instance.

Return type:

VoxCPM

Raises:

ValueError – If hf_model_id is empty.

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
    device="auto",
)

VoxCPM.generate(text, prompt_wav_path=None, prompt_text=None, reference_wav_path=None, cfg_value=2.0, inference_timesteps=10, min_len=2, max_len=4096, normalize=False, denoise=False, retry_badcase=True, retry_badcase_max_times=3, retry_badcase_ratio_threshold=6.0)¶

Synthesize speech from text.

Parameters:

text (str) – Input text to synthesize. For Voice Design, prepend control instructions in parentheses: "(warm female voice)Hello".
prompt_wav_path (str|None) – Prompt audio path for continuation-style cloning. Must be paired with prompt_text. For Hi-Fi cloning, combine it with reference_wav_path.
prompt_text (str|None) – Exact transcript of the prompt audio. Must be provided together with prompt_wav_path.
reference_wav_path (str|None) – Reference audio path for isolated voice cloning (VoxCPM 2 only). Can be used alone or combined with prompt_wav_path + prompt_text.
cfg_value (float) – Guidance scale. Higher values follow the conditioning more strictly; lower values allow more variation. Recommended: 1.0–3.0.
inference_timesteps (int) – Number of diffusion steps. More steps improve detail at the cost of speed. Recommended: 4–30.
min_len (int) – Minimum audio length in model tokens.
max_len (int) – Maximum token length during generation. Increase for very long outputs.
normalize (bool) – Run text normalization (expand numbers, dates, etc.) before generation.
denoise (bool) – Denoise prompt/reference audio before generation. Requires the denoiser to be loaded.
retry_badcase (bool) – Automatically retry when the generated audio length is abnormally short or long.
retry_badcase_max_times (int) – Maximum number of bad-case retries.
retry_badcase_ratio_threshold (float) – Audio-to-text duration ratio threshold for bad-case detection.

Returns:

1-D waveform array (float32). Sample rate is available at model.tts_model.sample_rate.

Return type:

numpy.ndarray

Raises:

ValueError – If text is empty.
ValueError – If prompt_wav_path and prompt_text are not both provided or both None.
ValueError – If reference_wav_path is used with a VoxCPM 1.x model.
FileNotFoundError – If audio file paths do not exist.

# Voice Design
wav = model.generate(
    text="(warm female voice)Hello from VoxCPM!",
    cfg_value=2.0,
)

# Reference-only cloning (VoxCPM 2)
wav = model.generate(
    text="Hello from VoxCPM!",
    reference_wav_path="speaker.wav",
)

# Hi-Fi cloning
wav = model.generate(
    text="Hello from VoxCPM!",
    prompt_wav_path="speaker.wav",
    prompt_text="Exact transcript of speaker.wav.",
    reference_wav_path="speaker.wav",
)

VoxCPM.generate_streaming(text, prompt_wav_path=None, prompt_text=None, reference_wav_path=None, cfg_value=2.0, inference_timesteps=10, min_len=2, max_len=4096, normalize=False, denoise=False, retry_badcase=True, retry_badcase_max_times=3, retry_badcase_ratio_threshold=6.0)¶

Same interface as generate(), but returns a generator that yields audio chunks incrementally.

All parameters are identical to generate().

Returns:: Generator yielding 1-D waveform chunks (float32).
Return type:: Generator[numpy.ndarray, None, None]

import numpy as np

chunks = []
for chunk in model.generate_streaming(text="Streaming output."):
    chunks.append(chunk)
wav = np.concatenate(chunks)

VoxCPM.load_lora(lora_weights_path)¶

Load LoRA weights from a checkpoint file or directory.

Parameters:: lora_weights_path (str) – Path to LoRA weights (.pth file or directory containing lora_weights.ckpt).
Returns:: (loaded_keys, skipped_keys) — lists of loaded and skipped parameter names.
Return type:: tuple[list[str], list[str]]
Raises:: RuntimeError – If model was not initialized with a LoRA config.

VoxCPM.unload_lora()¶: Reset all LoRA weights to their initial state (effectively zeroing them out). The LoRA layers remain in the model but have no effect.

VoxCPM.set_lora_enabled(enabled)¶

Enable or disable LoRA layers without unloading weights.

Parameters:: enabled (bool) – True to activate LoRA; False to use the base model only.

VoxCPM.get_lora_state_dict()¶

Get the current LoRA parameters.

Returns:: State dict containing all lora_A and lora_B parameters.
Return type:: dict

VoxCPM.lora_enabled: bool¶: True if a LoRA config is currently loaded on this model.

CLI¶

The voxcpm command provides three subcommands. Default model: openbmb/VoxCPM2.

Subcommands¶

voxcpm design

Generate speech from text without any reference audio. Optionally describe the target voice with --control.

voxcpm design --text "Hello world" --output out.wav
voxcpm design --text "Hello world" --control "warm female voice" --output out.wav

voxcpm clone

Clone a voice using reference audio or prompt audio with transcript.

# Reference-only cloning (VoxCPM 2)
voxcpm clone --text "Hello" --reference-audio ref.wav --output out.wav

# Hi-Fi cloning
voxcpm clone --text "Hello" \
    --prompt-audio ref.wav --prompt-text "Transcript of ref.wav" \
    --reference-audio ref.wav --output out.wav

# With style control
voxcpm clone --text "Hello" --reference-audio ref.wav \
    --control "speaking slowly" --output out.wav

voxcpm batch

Process a text file where each line becomes a separate output WAV (output_001.wav, output_002.wav, …).

voxcpm batch --input texts.txt --output-dir ./outs
voxcpm batch --input texts.txt --output-dir ./outs --reference-audio ref.wav

Arguments¶

Generation

--text, -t <TEXT>¶: Text to synthesize.

--control <INSTRUCTION>¶: Control instruction for voice design or style control (e.g. "warm female voice"). Cannot be used together with --prompt-text.

--cfg-value <FLOAT>¶: CFG guidance scale. Default: 2.0. Recommended: 1.0–3.0.

--inference-timesteps <INT>¶: Number of diffusion steps. Default: 10. Recommended: 4–30.

--normalize¶: Enable text normalization (expand numbers, dates, etc.).

Prompt & Reference Audio

--prompt-audio, -pa <PATH>¶: Prompt audio file for continuation mode. Requires --prompt-text or --prompt-file.

--prompt-text, -pt <TEXT>¶: Text transcript of the prompt audio.

--prompt-file <PATH>¶: Text file containing the prompt transcript (alternative to --prompt-text).

--reference-audio, -ra <PATH>¶: Reference audio for isolated voice cloning (VoxCPM 2 only).

--denoise¶: Denoise prompt/reference audio with ZipEnhancer before generation.

Model Loading

--model-path <PATH>¶: Local model directory. If set, --hf-model-id is ignored.

--hf-model-id <ID>¶: Hugging Face repo id. Default: openbmb/VoxCPM2.

--device <DEVICE>¶: Runtime device selection. Supported values are auto, cpu, mps, cuda, and indexed CUDA devices such as cuda:0. auto prefers cuda -> mps -> cpu. Explicit values raise an error instead of silently falling back.

--cache-dir <PATH>¶: Cache directory for Hub downloads.

--local-files-only¶: Only use local files, do not download from Hub.

--no-denoiser¶: Skip loading the denoiser model.

--no-optimize¶: Disable torch.compile acceleration. This is commonly useful for debugging, unsupported platforms, and non-CUDA environments.

--zipenhancer-path <PATH>¶: Custom ZipEnhancer model id or local path.

LoRA

--lora-path <PATH>¶: Path to LoRA weights directory.

--lora-r <INT>¶: LoRA rank. Default: 32.

--lora-alpha <INT>¶: LoRA alpha (scaling = alpha / r). Default: 16.

--lora-dropout <FLOAT>¶: LoRA dropout rate (0.0–1.0). Default: 0.0.

--lora-disable-lm¶: Disable LoRA on LM layers.

--lora-disable-dit¶: Disable LoRA on DiT layers.

--lora-enable-proj¶: Enable LoRA on projection layers.

Note

The legacy flat CLI (voxcpm --text "..." --output out.wav) still works but is deprecated. Prefer the subcommand style.