======================== ComfyUI-VoxCPMTTS ======================== ComfyUI-VoxCPMTTS is a lightweight ComfyUI node for VoxCPM 1.5, featuring built-in **automatic speech recognition** for reference audio transcription. - Repo: `1038lab/ComfyUI-VoxCPMTTS `_ .. note:: For LoRA training and dual-model support, see :doc:`comfyui_voxcpm`. .. list-table:: Supported VoxCPM Versions :widths: 30 70 :header-rows: 0 * - VoxCPM 1.0 (0.5B) - ✅ Available in UI (16 kHz) * - VoxCPM 1.5 - ✅ Default and recommended (44.1 kHz) * - VoxCPM 2 - ❌ Not supported This extension provides two node variants: * **AILab_VoxCPMTTS** — simplified node with hidden advanced parameters * **AILab_VoxCPMTTS_Advanced** — full control over all generation parameters Features -------- * TTS and voice cloning with **VoxCPM 1.5** (44.1 kHz output); VoxCPM 0.5B also available in the UI * **Auto-transcription** of reference audio via faster-whisper (requires enabling ``auto_transcribe_reference``) * Fade-in post-processing for smoother audio output (configurable ``fade_in_ms`` on Advanced node) * Simplified node uses fixed defaults (``cfg_value=2.0``, ``inference_steps=10``); Advanced node exposes full manual control * ``REFERENCE_TEXT`` output port for inspecting ASR results * Configurable ASR model via environment variable ``VOXCPM_ASR_MODEL`` (``tiny`` / ``small`` / ``medium`` / ``large``) * Multi-device: **auto** (auto-detect), **CUDA**, **MPS**, **CPU** Prerequisites ------------- * `ComfyUI `_ installed and running * PyTorch with appropriate backend * ~1.2 GB disk space for model downloads Installation ------------ * Via ComfyUI Manager: search for ``VoxCPMTTS`` and install. Manual installation: .. code-block:: bash cd ComfyUI/custom_nodes/ git clone https://github.com/1038lab/ComfyUI-VoxCPMTTS.git pip install -r ComfyUI-VoxCPMTTS/requirements.txt # Restart ComfyUI Models are auto-downloaded on first use. Default path is ``ComfyUI/models/TTS/VoxCPM1.5/`` for the 1.5 model, or ``ComfyUI/models/TTS/VoxCPM-0.5B/`` if 0.5B is selected. Basic usage ----------- Text-to-Speech ^^^^^^^^^^^^^^ 1. Add the **VoxCPM TTS** node (simplified) or **VoxCPM TTS (Advanced)** node 2. Enter the ``text`` to synthesize 3. The simplified node uses sensible defaults (``cfg_value=2.0``, ``inference_steps=10``) 4. The Advanced node lets you manually set ``cfg_value``, ``inference_steps``, and other parameters Voice cloning ^^^^^^^^^^^^^ 1. Connect ``reference_audio`` with a reference audio clip 2. For auto-transcription: enable ``auto_transcribe_reference`` on the node (required — leaving ``reference_text`` empty alone is not sufficient) 3. Alternatively, provide the transcript manually in ``reference_text`` 4. The ``REFERENCE_TEXT`` output shows the detected/provided transcript for verification Recommended parameter tuning ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The README suggests the following parameter combinations for different speed/quality trade-offs (set manually on the Advanced node): .. list-table:: :widths: 20 25 25 30 :header-rows: 1 * - Profile - CFG Value - Inference Steps - Speed * - Fast - 1.5 - 5 - Fastest * - Balanced - 2.0 - 10 - Moderate * - High Quality - 3.0 - 20 - Slowest Advanced node parameters ^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 75 :header-rows: 1 * - Parameter - Description * - ``cfg_value`` - Classifier-free guidance scale * - ``inference_steps`` - LocDiT diffusion steps * - ``max_length`` - Maximum generation token length * - ``fade_in_ms`` - Fade-in duration for audio smoothing * - ``retry_attempts`` - Maximum retries for bad outputs * - ``retry_threshold`` - Threshold for bad-case detection * - ``auto_transcribe_reference`` - Enable ASR for reference audio * - ``normalize`` - Enable text normalization * - ``unload_model`` - Unload model after inference to free VRAM Troubleshooting --------------- Out of Memory (OOM) ^^^^^^^^^^^^^^^^^^^^ VoxCPM 1.5 requires significant VRAM. If you encounter OOM errors: * Enable ``unload_model`` to release GPU memory after each generation * Switch ``device`` to ``cpu`` (slower but uses system RAM) * Close other GPU-intensive applications * Try the **Fast** quality preset to reduce memory usage Model download issues ^^^^^^^^^^^^^^^^^^^^^ If auto-download fails, manually download from `Hugging Face `_ and place files in ``ComfyUI/models/TTS/VoxCPM1.5/``. For debug logging, set ``COMFYUI_LOG_LEVEL=DEBUG``. Comparing with ComfyUI-VoxCPM ------------------------------ .. list-table:: :widths: 30 35 35 :header-rows: 1 * - Feature - ComfyUI-VoxCPM - ComfyUI-VoxCPMTTS * - Model support - VoxCPM 1.5 + 0.5B - VoxCPM 1.5 (recommended) + 0.5B * - LoRA training - ✅ Built-in - ❌ * - Auto-transcription - ❌ Manual only - ✅ faster-whisper * - Node variants - Single node - Simple + Advanced * - Quality presets - Manual parameters - Fast / Balanced / HQ * - Dependencies - Heavier (LoRA training) - Lighter