微调实战:在 LibriSpeech 上微调 VoxCPM 2

完整流程:数据准备 → 微调 → 推理,使用公开的 LibriSpeech 语料。同类语音数据集均可沿用同一流程 — 只需将数据准备步骤换成你的数据源。


准备工作

硬件要求

在 VoxCPM 2 上,batch_size=16max_batch_tokens=8192 时的粗略估算;实际占用取决于音频长度与梯度累积步数。

配置

SFT(全量微调)

LoRA

单卡

约 40 GB 显存

约 20 GB 显存

DDP — 每卡额外开销

+约 10 GB

+约 10 GB

备注

DDP 额外显存 来自每张卡用于 allreduce 的梯度桶(约 可训练参数 × 4 字节)以及 NCCL 缓冲。若 DDP 下 OOM,可降低 batch_sizemax_batch_tokens

软件要求

依赖

要求

Python

3.10 或 3.11(训练推荐)

PyTorch

2.5.0+(与驱动匹配的 CUDA 构建)

CUDA 驱动

12.0+

磁盘空间

train-clean-100 约需 30 GB;checkpoint 约 5 GB

pip install -e .

第 1 步:下载 LibriSpeech

# train-clean-100  (~6.3 GB compressed, ~30 GB extracted)
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
tar -xzf train-clean-100.tar.gz

解压后的目录结构:

LibriSpeech/
└── train-clean-100/
    └── {speaker_id}/
        └── {chapter_id}/
            ├── {speaker_id}-{chapter_id}-{utt_id}.flac
            └── {speaker_id}-{chapter_id}.trans.txt   # "UTT_ID TRANSCRIPT" per line

第 2 步:生成 JSONL 清单

训练脚本需要 JSONL 清单 — 每行一个 JSON 对象,至少包含 audio 路径与 text 转写。

将下方脚本保存为 scripts/prepare_librispeech_manifest.py 并运行一次:

import json
from pathlib import Path

LIBRISPEECH_ROOT = Path("/path/to/LibriSpeech/train-clean-100")
OUTPUT_PATH      = Path("examples/librispeech_train.jsonl")
MAX_SAMPLES      = 1000

entries = []
for trans_file in sorted(LIBRISPEECH_ROOT.rglob("*.trans.txt")):
    speaker_chapter_dir = trans_file.parent
    with open(trans_file, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            utt_id, text = line.split(" ", 1)
            audio_path = speaker_chapter_dir / f"{utt_id}.flac"
            if audio_path.exists():
                entries.append({"audio": str(audio_path), "text": text.capitalize()})
            if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
                break
    if MAX_SAMPLES and len(entries) >= MAX_SAMPLES:
        break

OUTPUT_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_PATH, "w", encoding="utf-8") as f:
    for entry in entries:
        f.write(json.dumps(entry, ensure_ascii=False) + "\n")

print(f"Wrote {len(entries):,} entries → {OUTPUT_PATH}")
python scripts/prepare_librispeech_manifest.py

生成的清单示例如下:

{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac", "text": "Chapter one missus rachel lynde is surprised ..."}
{"audio": "/path/to/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac", "text": "That had its source away back in the woods of the old cuthbert place ..."}

备注

为何**用 ``text.capitalize()`` **而不是保留全大写? VoxCPM 预训练语料为句首大写风格。推理时若喂入全大写转写可能损害文本贴合度。str.capitalize() 是简单启发式;生产环境可用专门的 truecasing 模型效果更好。


第 3a 步:全量微调(SFT)

全量微调会更新 全部参数。适合大数据集或领域偏移明显、LoRA 容量不足的场景。

配置文件

保存为 conf/librispeech_full.yaml

pretrained_path: /path/to/VoxCPM2/             # directory with config.json + model.safetensors
train_manifest:  examples/librispeech_train.jsonl
val_manifest:    examples/librispeech_val.jsonl  # strongly recommended — enables early stopping

sample_rate:        16000   # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate:    48000   # AudioVAE decoder output rate; only used at inference, not during training
batch_size:         2
grad_accum_steps:   8       # effective bs = batch_size × grad_accum_steps = 16
num_workers:        8

num_iters:          62      # ~1 epoch for 1,000 clips at effective bs=16; adjust per your dataset
max_steps:          62
log_interval:       10
valid_interval:     62
save_interval:      62

learning_rate:  1.0e-5      # ~10× smaller than LoRA to avoid catastrophic forgetting
weight_decay:   0.01
warmup_steps:   6           # ≈ 10 % of num_iters
max_batch_tokens: 8192      # filters out clips whose token count > max_batch_tokens // batch_size

save_path:   checkpoints/librispeech_full
tensorboard: checkpoints/librispeech_full/logs

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

启动命令

# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml

# Multi-GPU (4×)
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
    scripts/train_voxcpm_finetune.py --config_path conf/librispeech_full.yaml

也可在 run_train.sh 顶部填写 CONFIG_PATHTRAIN_MANIFESTBATCH_SIZE 后执行 bash run_train.sh


第 3b 步:LoRA 微调

LoRA 冻结基座,仅训练少量低秩增量矩阵。 建议作为默认起点。

配置文件

保存为 conf/librispeech_lora.yaml

pretrained_path: /path/to/VoxCPM2/
train_manifest:  examples/librispeech_train.jsonl
val_manifest:    examples/librispeech_val.jsonl

sample_rate:        16000   # AudioVAE encoder input rate (NOT the 48kHz output rate)
out_sample_rate:    48000   # AudioVAE decoder output rate; only used at inference, not during training
batch_size:         2
grad_accum_steps:   8       # effective bs = 16
num_workers:        8

num_iters:          62      # ~1 epoch for 1,000 clips at effective bs=16
max_steps:          62
log_interval:       10
valid_interval:     62
save_interval:      62

learning_rate:  1.0e-4
weight_decay:   0.01
warmup_steps:   6           # ≈ 10 % of num_iters
max_batch_tokens: 8192

save_path:   checkpoints/librispeech_lora
tensorboard: checkpoints/librispeech_lora/logs

lambdas:
  loss/diff: 1.0
  loss/stop: 1.0

lora:
  enable_lm:   true
  enable_dit:  true    # critical for voice quality — do not disable
  enable_proj: false
  r:     8             # r=8 for speaker adaptation; r=32–64 for new languages
  alpha: 16
  dropout: 0.0

启动命令

python scripts/train_voxcpm_finetune.py --config_path conf/librispeech_lora.yaml

第 4 步:观察微调过程

tensorboard --logdir checkpoints/librispeech_full/logs
# or
tensorboard --logdir checkpoints/librispeech_lora/logs

重点观察哪些指标

指标

健康表现

loss/diff

平稳下降;接近收敛时趋于平缓

loss/stop

前 100–200 步快速下降,之后保持较低

grad_norm

大致落在 0.3–2.0;偶发尖峰可接受

lr

余弦 warm-up 后衰减

val/loss

随训练损失变化; 若训练损失仍降而验证损失上升,应停止

什么时候可以停止训练

TTS 微调 通常 1–2 个 epoch 足够。最佳 checkpoint 往往不在最后。

  • 使用 valid_interval: 50save_interval: 50 便于回滚。

  • 选取 val/loss 最低时对应的 checkpoint。

  • 若无验证清单,可在收敛区间取若干 checkpoint 用推理脚本试听,选听感最佳者。

警告

val/loss 开始上升而 train/loss 仍在下降,请**立即停止**并回退。这是 TTS 模型典型的过拟合信号:模型会忽略输入文本,无论输入什么都生成相似的音色模式。


第 5 步:推理验证

SFT checkpoint 推理

# Standard TTS
python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir checkpoints/librispeech_full/latest \
    --text "She walked slowly along the quiet avenue, listening to the wind." \
    --output output_full.wav

# Voice cloning (pass a reference clip and its exact transcript)
python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir checkpoints/librispeech_full/latest \
    --text "She walked slowly along the quiet avenue, listening to the wind." \
    --prompt_audio examples/reference_speaker.wav \
    --prompt_text  "Exact transcript of the reference audio." \
    --output output_full_cloned.wav

LoRA checkpoint 推理

python scripts/test_voxcpm_lora_infer.py \
    --lora_ckpt checkpoints/librispeech_lora/latest \
    --text "She walked slowly along the quiet avenue, listening to the wind." \
    --output output_lora.wav

批量评估并对比多个 checkpoint:

for ckpt in checkpoints/librispeech_lora/step_*/; do
    python scripts/test_voxcpm_lora_infer.py \
        --lora_ckpt "$ckpt" \
        --text "Evaluation sentence." \
        --output "eval_$(basename $ckpt).wav"
done

常见问题排查

显存不足(OOM)

LibriSpeech 片段时长不一(2–35 s)。max_batch_tokens 会过滤最长部分。若仍 OOM,可尝试:

# Option 1 — smaller batch with same effective size
batch_size:       8
grad_accum_steps: 2

# Option 2 — tighter token budget
max_batch_tokens: 4096

损失不下降

  • 确认清单中的音频路径正确且文件可读。

  • LibriSpeech FLAC 为 16 kHz;保持 sample_rate: 16000 即可 — 这与 AudioVAE 编码器输入采样率一致。数据加载器会自动重采样。

  • 确认转写为句首大写,而非全大写。

生成音频忽略输入文本

典型过拟合表现。请回退到较早的 checkpoint:

ls checkpoints/librispeech_full/   # find a step before divergence

python scripts/test_voxcpm_ft_infer.py \
    --ckpt_dir checkpoints/librispeech_full/step_0001000 \
    --text "Test sentence." \
    --output test.wav

后续训练:始终提供 val_manifest,使用 valid_interval: 50,并在 val/loss 拐头向上时停止。将训练控制在 1–3 个 epoch 内通常可避免此问题。