Fine-Tuning Walkthrough: VoxCPM 2 on LibriSpeech¶
A complete walkthrough: data preparation → training → inference, using the publicly available LibriSpeech corpus. The same workflow applies to any comparable speech dataset — just swap the data-preparation step for your own source.
Prerequisites¶
Hardware¶
Rough estimates with batch_size=16 and max_batch_tokens=8192 on VoxCPM 2. Actual usage depends on audio length and accumulation steps.
Setup |
SFT (full fine-tuning) |
LoRA |
|---|---|---|
Single GPU |
~40 GB VRAM |
~20 GB VRAM |
DDP — additional per-card overhead |
+~10 GB |
+~10 GB |
Note
DDP extra memory comes from a per-GPU gradient bucket for allreduce communication (≈ all trainable params × 4 bytes) plus NCCL buffers. If you hit OOM in DDP, reduce batch_size or max_batch_tokens.
Software¶
Dependency |
Requirement |
|---|---|
Python |
3.10 or 3.11 (recommended for training) |
PyTorch |
2.5.0+ (CUDA build matching your driver) |
CUDA driver |
12.0+ |
Disk space |
~30 GB for |
pip install -e .
Step 1 — Download LibriSpeech¶
# train-clean-100 (~6.3 GB compressed, ~30 GB extracted)
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz
The extracted directory layout:
LibriSpeech/
└── train-clean-100/
└── {speaker_id}/
└── {chapter_id}/
├── {speaker_id}-{chapter_id}-{utt_id}.flac
└── {speaker_id}-{chapter_id}.trans.txt # "UTT_ID TRANSCRIPT" per line
Step 2 — Build the JSONL Manifest¶
The training script expects a JSONL manifest — one JSON object per line with at minimum an audio path and a text transcript.
Save the script below as scripts/prepare_librispeech_manifest.py and run it once:
import json
from pathlib import Path
LIBRISPEECH_ROOT = Path("/path/to/LibriSpeech/train-clean-100")
OUTPUT_PATH = Path("examples/librispeech_train.jsonl")
MAX_SAMPLES = 1000
entries = []
for trans_file in sorted(LIBRISPEECH_ROOT.rglob("*.trans.txt")):
speaker_chapter_dir = trans_file.parent
with open(trans_file, encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
utt_id, text = line.split(" ", 1)
audio_path = speaker_chapter_dir / f"{utt_id}.flac"
if audio_path.exists():
entries.append({"audio": str(audio_path), "text": text.capitalize()})
if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
break
if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
break
OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
for entry in entries:
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
print(f"Wrote {len(entries):,} entries → {OUTPUT_PATH}")
python scripts/prepare_librispeech_manifest.py
The resulting manifest looks like:
{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac", "text": "Chapter one missus rachel lynde is surprised ..."}
{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac", "text": "That had its source away back in the woods of the old cuthbert place ..."}
Note
Why text.capitalize() instead of leaving ALL-CAPS? VoxCPM’s pre-training corpus uses sentence-cased text. Feeding ALL-CAPS transcripts can degrade text adherence at inference time. str.capitalize() is a simple heuristic; a proper truecasing model gives better results for production use.
Step 3a — Full Fine-Tuning (SFT)¶
Full fine-tuning updates all model parameters. Best for large datasets or significant domain shifts where LoRA capacity is insufficient.
Config file¶
Save as conf/librispeech_full.yaml:
pretrained_path: /path/to/VoxCPM2/ # directory with config.json + model.safetensors
train_manifest: examples/librispeech_train.jsonl
val_manifest: examples/librispeech_val.jsonl # strongly recommended — enables early stopping
sample_rate: 16000 # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate: 48000 # AudioVAE decoder output rate; only used at inference, not during training
batch_size: 2
grad_accum_steps: 8 # effective bs = batch_size × grad_accum_steps = 16
num_workers: 8
num_iters: 62 # ~1 epoch for 1,000 clips at effective bs=16; adjust per your dataset
max_steps: 62
log_interval: 10
valid_interval: 62
save_interval: 62
learning_rate: 1.0e-5 # ~10× smaller than LoRA to avoid catastrophic forgetting
weight_decay: 0.01
warmup_steps: 6 # ≈ 10 % of num_iters
max_batch_tokens: 8192 # filters out clips whose token count > max_batch_tokens // batch_size
save_path: checkpoints/librispeech_full
tensorboard: checkpoints/librispeech_full/logs
lambdas:
loss/diff: 1.0
loss/stop: 1.0
Launch¶
# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml
# Multi-GPU (4×)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml
You can also fill in CONFIG_PATH, TRAIN_MANIFEST, and BATCH_SIZE at the top of run_train.sh and run bash run_train.sh.
Step 3b — LoRA Fine-Tuning¶
LoRA freezes the base model and trains only a small set of low-rank delta matrices. Recommended as the default starting point.
Config file¶
Save as conf/librispeech_lora.yaml:
pretrained_path: /path/to/VoxCPM2/
train_manifest: examples/librispeech_train.jsonl
val_manifest: examples/librispeech_val.jsonl
sample_rate: 16000 # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate: 48000 # AudioVAE decoder output rate; only used at inference, not during training
batch_size: 2
grad_accum_steps: 8 # effective bs = 16
num_workers: 8
num_iters: 62 # ~1 epoch for 1,000 clips at effective bs=16
max_steps: 62
log_interval: 10
valid_interval: 62
save_interval: 62
learning_rate: 1.0e-4
weight_decay: 0.01
warmup_steps: 6 # ≈ 10 % of num_iters
max_batch_tokens: 8192
save_path: checkpoints/librispeech_lora
tensorboard: checkpoints/librispeech_lora/logs
lambdas:
loss/diff: 1.0
loss/stop: 1.0
lora:
enable_lm: true
enable_dit: true # critical for voice quality — do not disable
enable_proj: false
r: 8 # r=8 for speaker adaptation; r=32–64 for new languages
alpha: 16
dropout: 0.0
Launch¶
python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_lora.yaml
Step 4 — Monitor Training¶
tensorboard --logdir checkpoints/librispeech_full/logs
# or
tensorboard --logdir checkpoints/librispeech_lora/logs
Metrics to watch¶
Metric |
Healthy pattern |
|---|---|
|
Decreases steadily; flattens as convergence approaches |
|
Drops quickly in the first 100–200 steps, then stays low |
|
Stays roughly in the 0.3–2.0 range; occasional spikes are fine |
|
Cosine warm-up then decay |
|
Tracks training loss; stop if it rises while train loss keeps falling |
When to stop¶
1–2 epochs is almost always enough for TTS fine-tuning. The best checkpoint is often not the final one.
Use
valid_interval: 50andsave_interval: 50for rollback options.Pick the checkpoint where
val/losswas lowest.If you do not have a val manifest, evaluate a handful of checkpoints in the convergence zone with the inference script and pick the best-sounding one.
Warning
If val/loss starts rising while train/loss keeps falling, stop immediately and roll back. This is the classic overfitting signature for TTS models: the model will start ignoring input text and generate the same voice pattern regardless of what you type.
Step 5 — Inference¶
SFT checkpoint¶
# Standard TTS
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir checkpoints/librispeech_full/latest \
--text "She walked slowly along the quiet avenue, listening to the wind." \
--output output_full.wav
# Voice cloning (pass a reference clip and its exact transcript)
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir checkpoints/librispeech_full/latest \
--text "She walked slowly along the quiet avenue, listening to the wind." \
--prompt_audio examples/reference_speaker.wav \
--prompt_text "Exact transcript of the reference audio." \
--output output_full_cloned.wav
LoRA checkpoint¶
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt checkpoints/librispeech_lora/latest \
--text "She walked slowly along the quiet avenue, listening to the wind." \
--output output_lora.wav
To batch-evaluate and compare multiple checkpoints:
for ckpt in checkpoints/librispeech_lora/step_*/; do
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt "$ckpt" \
--text "Evaluation sentence." \
--output "eval_$(basename $ckpt).wav"
done
Troubleshooting¶
Out-of-memory (OOM)¶
LibriSpeech clips vary in duration (2 s – 35 s). max_batch_tokens already filters the longest ones. If OOM persists, try:
# Option 1 — smaller batch with same effective size
batch_size: 8
grad_accum_steps: 2
# Option 2 — tighter token budget
max_batch_tokens: 4096
Loss does not decrease¶
Verify that audio paths in the manifest are correct and all files are readable.
LibriSpeech FLAC files are 16 kHz; keep
sample_rate: 16000— this matches the AudioVAE encoder input rate. The dataloader resamples automatically.Check that transcripts are sentence-cased, not ALL-CAPS.
Generated audio ignores input text¶
Classic overfitting symptom. Roll back to an earlier checkpoint:
ls checkpoints/librispeech_full/ # find a step before divergence
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir checkpoints/librispeech_full/step_0001000 \
--text "Test sentence." \
--output test.wav
For future runs: always provide a val_manifest, use valid_interval: 50, and stop when val/loss turns upward. Keeping training within 1–3 epochs generally avoids this problem entirely.