API Reference¶
Python API¶
- class VoxCPM(voxcpm_model_path, zipenhancer_model_path='iic/speech_zipenhancer_ans_multiloss_16k_base', enable_denoiser=True, optimize=True, device=None, lora_config=None, lora_weights_path=None)¶
Initialize VoxCPM from a local model directory.
The model architecture (
voxcpmorvoxcpm2) is auto-detected from thearchitecturefield inconfig.json.- Parameters:
voxcpm_model_path (str) – Local path to the model directory containing weights, configs, and tokenizer files.
zipenhancer_model_path (str|None) – ModelScope denoiser model id or local path. Set to
Noneto skip the denoiser entirely.enable_denoiser (bool) – Whether to initialize the ZipEnhancer denoiser pipeline.
optimize (bool) – Enable
torch.compileacceleration. Disable for debugging or unsupported platforms. This optimization is primarily useful on CUDA.device (str|None) – Runtime device. Use
Noneor"auto"for automatic fallback (cuda -> mps -> cpu), or an explicit value such as"cpu","mps","cuda", or"cuda:0". Explicit requests raise an error instead of silently falling back.lora_config (LoRAConfig|None) – LoRA configuration. If
lora_weights_pathis provided without this, a default config (enable_lm=True,enable_dit=True) is created automatically.lora_weights_path (str|None) – Path to pre-trained LoRA weights (
.pthfile or directory containinglora_weights.ckpt).
model = VoxCPM( voxcpm_model_path="/path/to/VoxCPM2", enable_denoiser=False, device="auto", )
- classmethod VoxCPM.from_pretrained(hf_model_id='openbmb/VoxCPM2', load_denoiser=True, zipenhancer_model_id='iic/speech_zipenhancer_ans_multiloss_16k_base', cache_dir=None, local_files_only=False, optimize=True, device=None, lora_config=None, lora_weights_path=None, **kwargs)¶
Instantiate
VoxCPMfrom a Hugging Face Hub snapshot. Downloads model weights automatically on first use.- Parameters:
hf_model_id (str) – Hugging Face repo id (e.g.
"openbmb/VoxCPM2") or local directory path.load_denoiser (bool) – Whether to initialize the denoiser pipeline.
zipenhancer_model_id (str) – Denoiser model id or local path. Ignored when
load_denoiser=False.cache_dir (str|None) – Custom cache directory for the snapshot download.
local_files_only (bool) – If
True, only use local files and do not attempt to download.optimize (bool) – Enable
torch.compileacceleration. This is primarily a CUDA optimization.device (str|None) – Runtime device.
None/"auto"uses automatic fallback. Explicit choices such as"cpu","mps","cuda", and"cuda:0"are validated and do not auto-fallback.lora_config (LoRAConfig|None) – LoRA configuration for fine-tuned models.
lora_weights_path (str|None) – Path to LoRA weights. If provided, LoRA is loaded after initialization.
- Returns:
Initialized VoxCPM instance.
- Return type:
- Raises:
ValueError – If
hf_model_idis empty.
model = VoxCPM.from_pretrained( "openbmb/VoxCPM2", load_denoiser=False, device="auto", )
- VoxCPM.generate(text, prompt_wav_path=None, prompt_text=None, reference_wav_path=None, cfg_value=2.0, inference_timesteps=10, min_len=2, max_len=4096, normalize=False, denoise=False, retry_badcase=True, retry_badcase_max_times=3, retry_badcase_ratio_threshold=6.0)¶
Synthesize speech from text.
- Parameters:
text (str) – Input text to synthesize. For Voice Design, prepend control instructions in parentheses:
"(warm female voice)Hello".prompt_wav_path (str|None) – Prompt audio path for continuation-style cloning. Must be paired with
prompt_text. For Hi-Fi cloning, combine it withreference_wav_path.prompt_text (str|None) – Exact transcript of the prompt audio. Must be provided together with
prompt_wav_path.reference_wav_path (str|None) – Reference audio path for isolated voice cloning (VoxCPM 2 only). Can be used alone or combined with
prompt_wav_path+prompt_text.cfg_value (float) – Guidance scale. Higher values follow the conditioning more strictly; lower values allow more variation. Recommended: 1.0–3.0.
inference_timesteps (int) – Number of diffusion steps. More steps improve detail at the cost of speed. Recommended: 4–30.
min_len (int) – Minimum audio length in model tokens.
max_len (int) – Maximum token length during generation. Increase for very long outputs.
normalize (bool) – Run text normalization (expand numbers, dates, etc.) before generation.
denoise (bool) – Denoise prompt/reference audio before generation. Requires the denoiser to be loaded.
retry_badcase (bool) – Automatically retry when the generated audio length is abnormally short or long.
retry_badcase_max_times (int) – Maximum number of bad-case retries.
retry_badcase_ratio_threshold (float) – Audio-to-text duration ratio threshold for bad-case detection.
- Returns:
1-D waveform array (float32). Sample rate is available at
model.tts_model.sample_rate.- Return type:
numpy.ndarray
- Raises:
ValueError – If
textis empty.ValueError – If
prompt_wav_pathandprompt_textare not both provided or bothNone.ValueError – If
reference_wav_pathis used with a VoxCPM 1.x model.FileNotFoundError – If audio file paths do not exist.
# Voice Design wav = model.generate( text="(warm female voice)Hello from VoxCPM!", cfg_value=2.0, ) # Reference-only cloning (VoxCPM 2) wav = model.generate( text="Hello from VoxCPM!", reference_wav_path="speaker.wav", ) # Hi-Fi cloning wav = model.generate( text="Hello from VoxCPM!", prompt_wav_path="speaker.wav", prompt_text="Exact transcript of speaker.wav.", reference_wav_path="speaker.wav", )
- VoxCPM.generate_streaming(text, prompt_wav_path=None, prompt_text=None, reference_wav_path=None, cfg_value=2.0, inference_timesteps=10, min_len=2, max_len=4096, normalize=False, denoise=False, retry_badcase=True, retry_badcase_max_times=3, retry_badcase_ratio_threshold=6.0)¶
Same interface as
generate(), but returns a generator that yields audio chunks incrementally.All parameters are identical to
generate().- Returns:
Generator yielding 1-D waveform chunks (float32).
- Return type:
Generator[numpy.ndarray, None, None]
import numpy as np chunks = [] for chunk in model.generate_streaming(text="Streaming output."): chunks.append(chunk) wav = np.concatenate(chunks)
- VoxCPM.load_lora(lora_weights_path)¶
Load LoRA weights from a checkpoint file or directory.
- Parameters:
lora_weights_path (str) – Path to LoRA weights (
.pthfile or directory containinglora_weights.ckpt).- Returns:
(loaded_keys, skipped_keys)— lists of loaded and skipped parameter names.- Return type:
tuple[list[str], list[str]]
- Raises:
RuntimeError – If model was not initialized with a LoRA config.
- VoxCPM.unload_lora()¶
Reset all LoRA weights to their initial state (effectively zeroing them out). The LoRA layers remain in the model but have no effect.
- VoxCPM.set_lora_enabled(enabled)¶
Enable or disable LoRA layers without unloading weights.
- Parameters:
enabled (bool) –
Trueto activate LoRA;Falseto use the base model only.
- VoxCPM.get_lora_state_dict()¶
Get the current LoRA parameters.
- Returns:
State dict containing all
lora_Aandlora_Bparameters.- Return type:
dict
- VoxCPM.lora_enabled: bool¶
Trueif a LoRA config is currently loaded on this model.
CLI¶
The voxcpm command provides three subcommands. Default model: openbmb/VoxCPM2.
Subcommands¶
- voxcpm design
Generate speech from text without any reference audio. Optionally describe the target voice with
--control.voxcpm design --text "Hello world" --output out.wav voxcpm design --text "Hello world" --control "warm female voice" --output out.wav
- voxcpm clone
Clone a voice using reference audio or prompt audio with transcript.
# Reference-only cloning (VoxCPM 2) voxcpm clone --text "Hello" --reference-audio ref.wav --output out.wav # Hi-Fi cloning voxcpm clone --text "Hello" \ --prompt-audio ref.wav --prompt-text "Transcript of ref.wav" \ --reference-audio ref.wav --output out.wav # With style control voxcpm clone --text "Hello" --reference-audio ref.wav \ --control "speaking slowly" --output out.wav
- voxcpm batch
Process a text file where each line becomes a separate output WAV (
output_001.wav,output_002.wav, …).voxcpm batch --input texts.txt --output-dir ./outs voxcpm batch --input texts.txt --output-dir ./outs --reference-audio ref.wav
Arguments¶
Generation
- --text, -t <TEXT>¶
Text to synthesize.
- --control <INSTRUCTION>¶
Control instruction for voice design or style control (e.g.
"warm female voice"). Cannot be used together with--prompt-text.
- --cfg-value <FLOAT>¶
CFG guidance scale. Default:
2.0. Recommended: 1.0–3.0.
- --inference-timesteps <INT>¶
Number of diffusion steps. Default:
10. Recommended: 4–30.
- --normalize¶
Enable text normalization (expand numbers, dates, etc.).
Prompt & Reference Audio
- --prompt-audio, -pa <PATH>¶
Prompt audio file for continuation mode. Requires
--prompt-textor--prompt-file.
- --prompt-text, -pt <TEXT>¶
Text transcript of the prompt audio.
- --prompt-file <PATH>¶
Text file containing the prompt transcript (alternative to
--prompt-text).
- --reference-audio, -ra <PATH>¶
Reference audio for isolated voice cloning (VoxCPM 2 only).
- --denoise¶
Denoise prompt/reference audio with ZipEnhancer before generation.
Model Loading
- --model-path <PATH>¶
Local model directory. If set,
--hf-model-idis ignored.
- --hf-model-id <ID>¶
Hugging Face repo id. Default:
openbmb/VoxCPM2.
- --device <DEVICE>¶
Runtime device selection. Supported values are
auto,cpu,mps,cuda, and indexed CUDA devices such ascuda:0.autopreferscuda -> mps -> cpu. Explicit values raise an error instead of silently falling back.
- --cache-dir <PATH>¶
Cache directory for Hub downloads.
- --local-files-only¶
Only use local files, do not download from Hub.
- --no-denoiser¶
Skip loading the denoiser model.
- --no-optimize¶
Disable
torch.compileacceleration. This is commonly useful for debugging, unsupported platforms, and non-CUDA environments.
- --zipenhancer-path <PATH>¶
Custom ZipEnhancer model id or local path.
LoRA
- --lora-path <PATH>¶
Path to LoRA weights directory.
- --lora-r <INT>¶
LoRA rank. Default:
32.
- --lora-alpha <INT>¶
LoRA alpha (scaling = alpha / r). Default:
16.
- --lora-dropout <FLOAT>¶
LoRA dropout rate (0.0–1.0). Default:
0.0.
- --lora-disable-lm¶
Disable LoRA on LM layers.
- --lora-disable-dit¶
Disable LoRA on DiT layers.
- --lora-enable-proj¶
Enable LoRA on projection layers.
Note
The legacy flat CLI (voxcpm --text "..." --output out.wav) still works but is deprecated. Prefer the subcommand style.