Fine-Tuning FAQ =============== General ******* **Can I fine-tune for a new language?** Yes. For languages not yet supported by VoxCPM, we recommend full fine-tuning with 500+ hours of target-language data, mixed with some Chinese/English data to reduce forgetting. Use conservative learning rates (``1e-5``). For the full workflow, see :doc:`./finetune`. Training Issues *************** **Out of memory (OOM):** - Reduce ``batch_size`` or ``max_batch_tokens`` to filter out long samples. - Increase ``grad_accum_steps`` to maintain effective batch size with less per-step memory. - Switch to LoRA fine-tuning — it uses roughly half the VRAM of full fine-tuning. - For multi-GPU (DDP), expect ~10 GB additional VRAM per card from gradient buckets and NCCL buffers. **Training loss not converging:** - Decrease ``learning_rate``. - Increase ``warmup_steps``. - Check data quality — noisy audio or mismatched transcripts are common culprits. **Resume training shows wrong step count:** (`#187 `_) This is a known bug in multi-GPU training. Ensure you are using the latest version of the training scripts. Output Quality Issues ********************* **Model ignores input text after fine-tuning (overfitting):** (`#169 `_) The model has overfit to reproducing training audio without text conditioning. This is the most common fine-tuning failure mode and can emerge surprisingly early (within a few hundred steps for small datasets). - Keep ``training_cfg_rate=0.1`` (do NOT set it to 0). - Keep ``weight_decay=0.01``. - Reduce learning rate to ``1e-5`` (full FT) or ``1e-4`` (LoRA). - Monitor checkpoints at each ``save_interval``. For most single-speaker tasks, 1–3 epochs is sufficient — training beyond that often hurts. **Generation doesn't stop (runaway output):** (`#195 `_, `#124 `_) - Check your training data for clips with long trailing silence (>0.5 s) and trim them — this is the most common cause. - Enable ``retry_badcase=True`` at inference time as a safety net. - If fine-tuning a new language, the stop loss and diffusion loss may converge at different rates — try increasing the stop loss weight (``lambdas > loss/stop``). LoRA Issues *********** **Poor LoRA quality:** - Increase ``r`` (LoRA rank) — try 32 or 64 for harder tasks like style or language adaptation. - Adjust ``alpha`` — try ``alpha = r`` or ``alpha = 2*r``. - Ensure ``enable_dit: true`` — this is essential for voice quality. - Increase training steps if the model has not converged yet. **LoRA not taking effect at inference:** - Ensure the inference LoRA config (``r``, ``alpha``, ``enable_lm``, ``enable_dit``) matches the training config exactly. - Check the return value of ``load_lora`` — ``skipped_keys`` should be empty. - Verify ``set_lora_enabled(True)`` is called if you previously disabled it. Checkpoint Issues ***************** **Checkpoint loading errors:** - **Full fine-tuning:** the checkpoint directory must contain ``model.safetensors`` (or ``pytorch_model.bin``), ``config.json``, and ``audiovae.pth``. - **LoRA:** the checkpoint directory must contain ``lora_weights.safetensors`` (or ``lora_weights.ckpt``) and ``lora_config.json``.