Fine-Tuning Walkthrough: VoxCPM 2 on LibriSpeech

A complete walkthrough: data preparation → training → inference, using the publicly available LibriSpeech corpus. The same workflow applies to any comparable speech dataset — just swap the data-preparation step for your own source.


Prerequisites

Hardware

Rough estimates with batch_size=16 and max_batch_tokens=8192 on VoxCPM 2. Actual usage depends on audio length and accumulation steps.

Setup

SFT (full fine-tuning)

LoRA

Single GPU

~40 GB VRAM

~20 GB VRAM

DDP — additional per-card overhead

+~10 GB

+~10 GB

Note

DDP extra memory comes from a per-GPU gradient bucket for allreduce communication (≈ all trainable params × 4 bytes) plus NCCL buffers. If you hit OOM in DDP, reduce batch_size or max_batch_tokens.

Software

Dependency

Requirement

Python

3.10 or 3.11 (recommended for training)

PyTorch

2.5.0+ (CUDA build matching your driver)

CUDA driver

12.0+

Disk space

~30 GB for train-clean-100; ~5 GB for checkpoints

pip install -e .

Step 1 — Download LibriSpeech

# train-clean-100  (~6.3 GB compressed, ~30 GB extracted)
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz

The extracted directory layout:

LibriSpeech/
└── train-clean-100/
    └── {speaker_id}/
        └── {chapter_id}/
            ├── {speaker_id}-{chapter_id}-{utt_id}.flac
            └── {speaker_id}-{chapter_id}.trans.txt   # "UTT_ID TRANSCRIPT" per line

Step 2 — Build the JSONL Manifest

The training script expects a JSONL manifest — one JSON object per line with at minimum an audio path and a text transcript.

Save the script below as scripts/prepare_librispeech_manifest.py and run it once:

import json
from pathlib import Path

LIBRISPEECH_ROOT = Path("/path/to/LibriSpeech/train-clean-100")
OUTPUT_PATH      = Path("examples/librispeech_train.jsonl")
MAX_SAMPLES      = 1000

entries = []
for trans_file in sorted(LIBRISPEECH_ROOT.rglob("*.trans.txt")):
    speaker_chapter_dir = trans_file.parent
    with open(trans_file, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            utt_id, text = line.split(" ", 1)
            audio_path = speaker_chapter_dir / f"{utt_id}.flac"
            if audio_path.exists():
                entries.append({"audio": str(audio_path), "text": text.capitalize()})
            if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
                break
    if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
        break

OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    for entry in entries:
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")

print(f"Wrote {len(entries):,} entries → {OUTPUT_PATH}")
python scripts/prepare_librispeech_manifest.py

The resulting manifest looks like:

{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac", "text": "Chapter one missus rachel lynde is surprised ..."}
{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac", "text": "That had its source away back in the woods of the old cuthbert place ..."}

Note

Why text.capitalize() instead of leaving ALL-CAPS? VoxCPM’s pre-training corpus uses sentence-cased text. Feeding ALL-CAPS transcripts can degrade text adherence at inference time. str.capitalize() is a simple heuristic; a proper truecasing model gives better results for production use.


Step 3a — Full Fine-Tuning (SFT)

Full fine-tuning updates all model parameters. Best for large datasets or significant domain shifts where LoRA capacity is insufficient.

Config file

Save as conf/librispeech_full.yaml:

pretrained_path: /path/to/VoxCPM2/             # directory with config.json + model.safetensors
train_manifest:  examples/librispeech_train.jsonl
val_manifest:    examples/librispeech_val.jsonl  # strongly recommended — enables early stopping

sample_rate:        16000   # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate:    48000   # AudioVAE decoder output rate; only used at inference, not during training
batch_size:         2
grad_accum_steps:   8       # effective bs = batch_size × grad_accum_steps = 16
num_workers:        8

num_iters:          62      # ~1 epoch for 1,000 clips at effective bs=16; adjust per your dataset
max_steps:          62
log_interval:       10
valid_interval:     62
save_interval:      62

learning_rate:  1.0e-5      # ~10× smaller than LoRA to avoid catastrophic forgetting
weight_decay:   0.01
warmup_steps:   6           # ≈ 10 % of num_iters
max_batch_tokens: 8192      # filters out clips whose token count > max_batch_tokens // batch_size

save_path:   checkpoints/librispeech_full
tensorboard: checkpoints/librispeech_full/logs

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

Launch

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml

# Multi-GPU (4×)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml

You can also fill in CONFIG_PATH, TRAIN_MANIFEST, and BATCH_SIZE at the top of run_train.sh and run bash run_train.sh.


Step 3b — LoRA Fine-Tuning

LoRA freezes the base model and trains only a small set of low-rank delta matrices. Recommended as the default starting point.

Config file

Save as conf/librispeech_lora.yaml:

pretrained_path: /path/to/VoxCPM2/
train_manifest:  examples/librispeech_train.jsonl
val_manifest:    examples/librispeech_val.jsonl

sample_rate:        16000   # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate:    48000   # AudioVAE decoder output rate; only used at inference, not during training
batch_size:         2
grad_accum_steps:   8       # effective bs = 16
num_workers:        8

num_iters:          62      # ~1 epoch for 1,000 clips at effective bs=16
max_steps:          62
log_interval:       10
valid_interval:     62
save_interval:      62

learning_rate:  1.0e-4
weight_decay:   0.01
warmup_steps:   6           # ≈ 10 % of num_iters
max_batch_tokens: 8192

save_path:   checkpoints/librispeech_lora
tensorboard: checkpoints/librispeech_lora/logs

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

lora:
  enable_lm:   true
  enable_dit:  true    # critical for voice quality — do not disable
  enable_proj: false
  r:     8             # r=8 for speaker adaptation; r=32–64 for new languages
  alpha: 16
  dropout: 0.0

Launch

python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_lora.yaml

Step 4 — Monitor Training

tensorboard --logdir checkpoints/librispeech_full/logs
# or
tensorboard --logdir checkpoints/librispeech_lora/logs

Metrics to watch

Metric

Healthy pattern

loss/diff

Decreases steadily; flattens as convergence approaches

loss/stop

Drops quickly in the first 100–200 steps, then stays low

grad_norm

Stays roughly in the 0.3–2.0 range; occasional spikes are fine

lr

Cosine warm-up then decay

val/loss

Tracks training loss; stop if it rises while train loss keeps falling

When to stop

1–2 epochs is almost always enough for TTS fine-tuning. The best checkpoint is often not the final one.

  • Use valid_interval: 50 and save_interval: 50 for rollback options.

  • Pick the checkpoint where val/loss was lowest.

  • If you do not have a val manifest, evaluate a handful of checkpoints in the convergence zone with the inference script and pick the best-sounding one.

Warning

If val/loss starts rising while train/loss keeps falling, stop immediately and roll back. This is the classic overfitting signature for TTS models: the model will start ignoring input text and generate the same voice pattern regardless of what you type.


Step 5 — Inference

SFT checkpoint

# Standard TTS
python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir checkpoints/librispeech_full/latest \
    --text "She walked slowly along the quiet avenue, listening to the wind." \
    --output output_full.wav

# Voice cloning (pass a reference clip and its exact transcript)
python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir checkpoints/librispeech_full/latest \
    --text "She walked slowly along the quiet avenue, listening to the wind." \
    --prompt_audio examples/reference_speaker.wav \
    --prompt_text  "Exact transcript of the reference audio." \
    --output output_full_cloned.wav

LoRA checkpoint

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt checkpoints/librispeech_lora/latest \
    --text "She walked slowly along the quiet avenue, listening to the wind." \
    --output output_lora.wav

To batch-evaluate and compare multiple checkpoints:

for ckpt in checkpoints/librispeech_lora/step_*/; do
    python scripts/test_voxcpm_lora_infer.py \
        --lora_ckpt "$ckpt" \
        --text "Evaluation sentence." \
        --output "eval_$(basename $ckpt).wav"
done

Troubleshooting

Out-of-memory (OOM)

LibriSpeech clips vary in duration (2 s – 35 s). max_batch_tokens already filters the longest ones. If OOM persists, try:

# Option 1 — smaller batch with same effective size
batch_size:       8
grad_accum_steps: 2

# Option 2 — tighter token budget
max_batch_tokens: 4096

Loss does not decrease

  • Verify that audio paths in the manifest are correct and all files are readable.

  • LibriSpeech FLAC files are 16 kHz; keep sample_rate: 16000 — this matches the AudioVAE encoder input rate. The dataloader resamples automatically.

  • Check that transcripts are sentence-cased, not ALL-CAPS.

Generated audio ignores input text

Classic overfitting symptom. Roll back to an earlier checkpoint:

ls checkpoints/librispeech_full/   # find a step before divergence

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir checkpoints/librispeech_full/step_0001000 \
    --text "Test sentence." \
    --output test.wav

For future runs: always provide a val_manifest, use valid_interval: 50, and stop when val/loss turns upward. Keeping training within 1–3 epochs generally avoids this problem entirely.