微调实战:在 LibriSpeech 上微调 VoxCPM 2¶
完整流程:数据准备 → 微调 → 推理,使用公开的 LibriSpeech 语料。同类语音数据集均可沿用同一流程 — 只需将数据准备步骤换成你的数据源。
准备工作¶
硬件要求¶
在 VoxCPM 2 上,batch_size=16、max_batch_tokens=8192 时的粗略估算;实际占用取决于音频长度与梯度累积步数。
配置 |
SFT(全量微调) |
LoRA |
|---|---|---|
单卡 |
约 40 GB 显存 |
约 20 GB 显存 |
DDP — 每卡额外开销 |
+约 10 GB |
+约 10 GB |
备注
DDP 额外显存 来自每张卡用于 allreduce 的梯度桶(约 可训练参数 × 4 字节)以及 NCCL 缓冲。若 DDP 下 OOM,可降低 batch_size 或 max_batch_tokens。
软件要求¶
依赖 |
要求 |
|---|---|
Python |
3.10 或 3.11(训练推荐) |
PyTorch |
2.5.0+(与驱动匹配的 CUDA 构建) |
CUDA 驱动 |
12.0+ |
磁盘空间 |
|
pip install -e .
第 1 步:下载 LibriSpeech¶
# train-clean-100 (~6.3 GB compressed, ~30 GB extracted)
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz
解压后的目录结构:
LibriSpeech/
└── train-clean-100/
└── {speaker_id}/
└── {chapter_id}/
├── {speaker_id}-{chapter_id}-{utt_id}.flac
└── {speaker_id}-{chapter_id}.trans.txt # "UTT_ID TRANSCRIPT" per line
第 2 步:生成 JSONL 清单¶
训练脚本需要 JSONL 清单 — 每行一个 JSON 对象,至少包含 audio 路径与 text 转写。
将下方脚本保存为 scripts/prepare_librispeech_manifest.py 并运行一次:
import json
from pathlib import Path
LIBRISPEECH_ROOT = Path("/path/to/LibriSpeech/train-clean-100")
OUTPUT_PATH = Path("examples/librispeech_train.jsonl")
MAX_SAMPLES = 1000
entries = []
for trans_file in sorted(LIBRISPEECH_ROOT.rglob("*.trans.txt")):
speaker_chapter_dir = trans_file.parent
with open(trans_file, encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
utt_id, text = line.split(" ", 1)
audio_path = speaker_chapter_dir / f"{utt_id}.flac"
if audio_path.exists():
entries.append({"audio": str(audio_path), "text": text.capitalize()})
if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
break
if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
break
OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
for entry in entries:
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
print(f"Wrote {len(entries):,} entries → {OUTPUT_PATH}")
python scripts/prepare_librispeech_manifest.py
生成的清单示例如下:
{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac", "text": "Chapter one missus rachel lynde is surprised ..."}
{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac", "text": "That had its source away back in the woods of the old cuthbert place ..."}
备注
为何**用 ``text.capitalize()`` **而不是保留全大写? VoxCPM 预训练语料为句首大写风格。推理时若喂入全大写转写可能损害文本贴合度。str.capitalize() 是简单启发式;生产环境可用专门的 truecasing 模型效果更好。
第 3a 步:全量微调(SFT)¶
全量微调会更新 全部参数。适合大数据集或领域偏移明显、LoRA 容量不足的场景。
配置文件¶
保存为 conf/librispeech_full.yaml:
pretrained_path: /path/to/VoxCPM2/ # directory with config.json + model.safetensors
train_manifest: examples/librispeech_train.jsonl
val_manifest: examples/librispeech_val.jsonl # strongly recommended — enables early stopping
sample_rate: 16000 # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate: 48000 # AudioVAE decoder output rate; only used at inference, not during training
batch_size: 2
grad_accum_steps: 8 # effective bs = batch_size × grad_accum_steps = 16
num_workers: 8
num_iters: 62 # ~1 epoch for 1,000 clips at effective bs=16; adjust per your dataset
max_steps: 62
log_interval: 10
valid_interval: 62
save_interval: 62
learning_rate: 1.0e-5 # ~10× smaller than LoRA to avoid catastrophic forgetting
weight_decay: 0.01
warmup_steps: 6 # ≈ 10 % of num_iters
max_batch_tokens: 8192 # filters out clips whose token count > max_batch_tokens // batch_size
save_path: checkpoints/librispeech_full
tensorboard: checkpoints/librispeech_full/logs
lambdas:
loss/diff: 1.0
loss/stop: 1.0
启动命令¶
# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml
# Multi-GPU (4×)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml
也可在 run_train.sh 顶部填写 CONFIG_PATH、TRAIN_MANIFEST、BATCH_SIZE 后执行 bash run_train.sh。
第 3b 步:LoRA 微调¶
LoRA 冻结基座,仅训练少量低秩增量矩阵。 建议作为默认起点。
配置文件¶
保存为 conf/librispeech_lora.yaml:
pretrained_path: /path/to/VoxCPM2/
train_manifest: examples/librispeech_train.jsonl
val_manifest: examples/librispeech_val.jsonl
sample_rate: 16000 # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate: 48000 # AudioVAE decoder output rate; only used at inference, not during training
batch_size: 2
grad_accum_steps: 8 # effective bs = 16
num_workers: 8
num_iters: 62 # ~1 epoch for 1,000 clips at effective bs=16
max_steps: 62
log_interval: 10
valid_interval: 62
save_interval: 62
learning_rate: 1.0e-4
weight_decay: 0.01
warmup_steps: 6 # ≈ 10 % of num_iters
max_batch_tokens: 8192
save_path: checkpoints/librispeech_lora
tensorboard: checkpoints/librispeech_lora/logs
lambdas:
loss/diff: 1.0
loss/stop: 1.0
lora:
enable_lm: true
enable_dit: true # critical for voice quality — do not disable
enable_proj: false
r: 8 # r=8 for speaker adaptation; r=32–64 for new languages
alpha: 16
dropout: 0.0
启动命令¶
python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_lora.yaml
第 4 步:观察微调过程¶
tensorboard --logdir checkpoints/librispeech_full/logs
# or
tensorboard --logdir checkpoints/librispeech_lora/logs
重点观察哪些指标¶
指标 |
健康表现 |
|---|---|
|
平稳下降;接近收敛时趋于平缓 |
|
前 100–200 步快速下降,之后保持较低 |
|
大致落在 0.3–2.0;偶发尖峰可接受 |
|
余弦 warm-up 后衰减 |
|
随训练损失变化; 若训练损失仍降而验证损失上升,应停止 |
什么时候可以停止训练¶
TTS 微调 通常 1–2 个 epoch 足够。最佳 checkpoint 往往不在最后。
使用
valid_interval: 50与save_interval: 50便于回滚。选取
val/loss最低时对应的 checkpoint。若无验证清单,可在收敛区间取若干 checkpoint 用推理脚本试听,选听感最佳者。
警告
若 val/loss 开始上升而 train/loss 仍在下降,请**立即停止**并回退。这是 TTS 模型典型的过拟合信号:模型会忽略输入文本,无论输入什么都生成相似的音色模式。
第 5 步:推理验证¶
SFT checkpoint 推理¶
# Standard TTS
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir checkpoints/librispeech_full/latest \
--text "She walked slowly along the quiet avenue, listening to the wind." \
--output output_full.wav
# Voice cloning (pass a reference clip and its exact transcript)
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir checkpoints/librispeech_full/latest \
--text "She walked slowly along the quiet avenue, listening to the wind." \
--prompt_audio examples/reference_speaker.wav \
--prompt_text "Exact transcript of the reference audio." \
--output output_full_cloned.wav
LoRA checkpoint 推理¶
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt checkpoints/librispeech_lora/latest \
--text "She walked slowly along the quiet avenue, listening to the wind." \
--output output_lora.wav
批量评估并对比多个 checkpoint:
for ckpt in checkpoints/librispeech_lora/step_*/; do
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt "$ckpt" \
--text "Evaluation sentence." \
--output "eval_$(basename $ckpt).wav"
done
常见问题排查¶
显存不足(OOM)¶
LibriSpeech 片段时长不一(2–35 s)。max_batch_tokens 会过滤最长部分。若仍 OOM,可尝试:
# Option 1 — smaller batch with same effective size
batch_size: 8
grad_accum_steps: 2
# Option 2 — tighter token budget
max_batch_tokens: 4096
损失不下降¶
确认清单中的音频路径正确且文件可读。
LibriSpeech FLAC 为 16 kHz;保持
sample_rate: 16000即可 — 这与 AudioVAE 编码器输入采样率一致。数据加载器会自动重采样。确认转写为句首大写,而非全大写。
生成音频忽略输入文本¶
典型过拟合表现。请回退到较早的 checkpoint:
ls checkpoints/librispeech_full/ # find a step before divergence
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir checkpoints/librispeech_full/step_0001000 \
--text "Test sentence." \
--output test.wav
后续训练:始终提供 val_manifest,使用 valid_interval: 50,并在 val/loss 拐头向上时停止。将训练控制在 1–3 个 epoch 内通常可避免此问题。