微调实战：在 LibriSpeech 上微调 VoxCPM 2¶

完整流程：数据准备 → 微调 → 推理，使用公开的 LibriSpeech 语料。同类语音数据集均可沿用同一流程 — 只需将数据准备步骤换成你的数据源。

准备工作¶

硬件要求¶

在 VoxCPM 2 上，batch_size=16、max_batch_tokens=8192 时的粗略估算；实际占用取决于音频长度与梯度累积步数。

配置	SFT（全量微调）	LoRA
单卡	约 40 GB 显存	约 20 GB 显存
DDP — 每卡额外开销	+约 10 GB	+约 10 GB

备注

DDP 额外显存 来自每张卡用于 allreduce 的梯度桶（约可训练参数 × 4 字节）以及 NCCL 缓冲。若 DDP 下 OOM，可降低 batch_size 或 max_batch_tokens。

软件要求¶

依赖	要求
Python	3.10 或 3.11（训练推荐）
PyTorch	2.5.0+（与驱动匹配的 CUDA 构建）
CUDA 驱动	12.0+
磁盘空间	`train-clean-100` 约需 30 GB；checkpoint 约 5 GB

pip install -e .

第 1 步：下载 LibriSpeech¶

# train-clean-100  (~6.3 GB compressed, ~30 GB extracted)
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz

解压后的目录结构：

LibriSpeech/
└── train-clean-100/
    └── {speaker_id}/
        └── {chapter_id}/
            ├── {speaker_id}-{chapter_id}-{utt_id}.flac
            └── {speaker_id}-{chapter_id}.trans.txt   # "UTT_ID TRANSCRIPT" per line

第 2 步：生成 JSONL 清单¶

训练脚本需要 JSONL 清单 — 每行一个 JSON 对象，至少包含 audio 路径与 text 转写。

将下方脚本保存为 scripts/prepare_librispeech_manifest.py 并运行一次：

import json
from pathlib import Path

LIBRISPEECH_ROOT = Path("/path/to/LibriSpeech/train-clean-100")
OUTPUT_PATH      = Path("examples/librispeech_train.jsonl")
MAX_SAMPLES      = 1000

entries = []
for trans_file in sorted(LIBRISPEECH_ROOT.rglob("*.trans.txt")):
    speaker_chapter_dir = trans_file.parent
    with open(trans_file, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            utt_id, text = line.split(" ", 1)
            audio_path = speaker_chapter_dir / f"{utt_id}.flac"
            if audio_path.exists():
                entries.append({"audio": str(audio_path), "text": text.capitalize()})
            if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
                break
    if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
        break

OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    for entry in entries:
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")

print(f"Wrote {len(entries):,} entries → {OUTPUT_PATH}")

python scripts/prepare_librispeech_manifest.py

生成的清单示例如下：

{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac", "text": "Chapter one missus rachel lynde is surprised ..."}
{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac", "text": "That had its source away back in the woods of the old cuthbert place ..."}

备注

为何**用 ``text.capitalize()`` **而不是保留全大写？ VoxCPM 预训练语料为句首大写风格。推理时若喂入全大写转写可能损害文本贴合度。str.capitalize() 是简单启发式；生产环境可用专门的 truecasing 模型效果更好。

第 3a 步：全量微调（SFT）¶

全量微调会更新 全部参数。适合大数据集或领域偏移明显、LoRA 容量不足的场景。

配置文件¶

保存为 conf/librispeech_full.yaml：

pretrained_path: /path/to/VoxCPM2/             # directory with config.json + model.safetensors
train_manifest:  examples/librispeech_train.jsonl
val_manifest:    examples/librispeech_val.jsonl  # strongly recommended — enables early stopping

sample_rate:        16000   # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate:    48000   # AudioVAE decoder output rate; only used at inference, not during training
batch_size:         2
grad_accum_steps:   8       # effective bs = batch_size × grad_accum_steps = 16
num_workers:        8

num_iters:          62      # ~1 epoch for 1,000 clips at effective bs=16; adjust per your dataset
max_steps:          62
log_interval:       10
valid_interval:     62
save_interval:      62

learning_rate:  1.0e-5      # ~10× smaller than LoRA to avoid catastrophic forgetting
weight_decay:   0.01
warmup_steps:   6           # ≈ 10 % of num_iters
max_batch_tokens: 8192      # filters out clips whose token count > max_batch_tokens // batch_size

save_path:   checkpoints/librispeech_full
tensorboard: checkpoints/librispeech_full/logs

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

启动命令¶

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml

# Multi-GPU (4×)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml

也可在 run_train.sh 顶部填写 CONFIG_PATH、TRAIN_MANIFEST、BATCH_SIZE 后执行 bash run_train.sh。

第 3b 步：LoRA 微调¶

LoRA 冻结基座，仅训练少量低秩增量矩阵。 建议作为默认起点。

配置文件¶

保存为 conf/librispeech_lora.yaml：

pretrained_path: /path/to/VoxCPM2/
train_manifest:  examples/librispeech_train.jsonl
val_manifest:    examples/librispeech_val.jsonl

sample_rate:        16000   # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate:    48000   # AudioVAE decoder output rate; only used at inference, not during training
batch_size:         2
grad_accum_steps:   8       # effective bs = 16
num_workers:        8

num_iters:          62      # ~1 epoch for 1,000 clips at effective bs=16
max_steps:          62
log_interval:       10
valid_interval:     62
save_interval:      62

learning_rate:  1.0e-4
weight_decay:   0.01
warmup_steps:   6           # ≈ 10 % of num_iters
max_batch_tokens: 8192

save_path:   checkpoints/librispeech_lora
tensorboard: checkpoints/librispeech_lora/logs

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

lora:
  enable_lm:   true
  enable_dit:  true    # critical for voice quality — do not disable
  enable_proj: false
  r:     8             # r=8 for speaker adaptation; r=32–64 for new languages
  alpha: 16
  dropout: 0.0

启动命令¶

python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_lora.yaml

第 4 步：观察微调过程¶

tensorboard --logdir checkpoints/librispeech_full/logs
# or
tensorboard --logdir checkpoints/librispeech_lora/logs

重点观察哪些指标¶

指标	健康表现
`loss/diff`	平稳下降；接近收敛时趋于平缓
`loss/stop`	前 100–200 步快速下降，之后保持较低
`grad_norm`	大致落在 0.3–2.0；偶发尖峰可接受
`lr`	余弦 warm-up 后衰减
`val/loss`	随训练损失变化；若训练损失仍降而验证损失上升，应停止

什么时候可以停止训练¶

TTS 微调 通常 1–2 个 epoch 足够。最佳 checkpoint 往往不在最后。

使用 valid_interval: 50 与 save_interval: 50 便于回滚。
选取 val/loss 最低时对应的 checkpoint。
若无验证清单，可在收敛区间取若干 checkpoint 用推理脚本试听，选听感最佳者。

警告

若 val/loss 开始上升而 train/loss 仍在下降，请**立即停止**并回退。这是 TTS 模型典型的过拟合信号：模型会忽略输入文本，无论输入什么都生成相似的音色模式。

第 5 步：推理验证¶

SFT checkpoint 推理¶

# Standard TTS
python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir checkpoints/librispeech_full/latest \
    --text "She walked slowly along the quiet avenue, listening to the wind." \
    --output output_full.wav

# Voice cloning (pass a reference clip and its exact transcript)
python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir checkpoints/librispeech_full/latest \
    --text "She walked slowly along the quiet avenue, listening to the wind." \
    --prompt_audio examples/reference_speaker.wav \
    --prompt_text  "Exact transcript of the reference audio." \
    --output output_full_cloned.wav

LoRA checkpoint 推理¶

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt checkpoints/librispeech_lora/latest \
    --text "She walked slowly along the quiet avenue, listening to the wind." \
    --output output_lora.wav

批量评估并对比多个 checkpoint：

for ckpt in checkpoints/librispeech_lora/step_*/; do
    python scripts/test_voxcpm_lora_infer.py \
        --lora_ckpt "$ckpt" \
        --text "Evaluation sentence." \
        --output "eval_$(basename $ckpt).wav"
done

常见问题排查¶

显存不足（OOM）¶

LibriSpeech 片段时长不一（2–35 s）。max_batch_tokens 会过滤最长部分。若仍 OOM，可尝试：

# Option 1 — smaller batch with same effective size
batch_size:       8
grad_accum_steps: 2

# Option 2 — tighter token budget
max_batch_tokens: 4096

损失不下降¶

确认清单中的音频路径正确且文件可读。
LibriSpeech FLAC 为 16 kHz；保持 sample_rate: 16000 即可 — 这与 AudioVAE 编码器输入采样率一致。数据加载器会自动重采样。
确认转写为句首大写，而非全大写。

生成音频忽略输入文本¶

典型过拟合表现。请回退到较早的 checkpoint：

ls checkpoints/librispeech_full/   # find a step before divergence

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir checkpoints/librispeech_full/step_0001000 \
    --text "Test sentence." \
    --output test.wav

后续训练：始终提供 val_manifest，使用 valid_interval: 50，并在 val/loss 拐头向上时停止。将训练控制在 1–3 个 epoch 内通常可避免此问题。