======================== VoxCPM.cpp ======================== VoxCPM.cpp is a standalone C++ inference engine for VoxCPM based on ggml, with GGUF model support. - Repo: `bluryar/VoxCPM.cpp `_ - GGUF Weights: `bluryar/VoxCPM-GGUF `_ - Upstream: `OpenBMB/VoxCPM `_ It provides a CLI tool (``voxcpm_tts``) for offline synthesis and an OpenAI-compatible HTTP server (``voxcpm-server``) with streaming support. .. list-table:: Supported VoxCPM Versions :widths: 30 70 :header-rows: 0 * - VoxCPM 1.0 (0.5B) - ✅ Supported * - VoxCPM 1.5 - ✅ Supported * - VoxCPM 2 - ❌ Not supported Multiple quantization formats available: Q4_K, Q8_0, F16, F32. Features -------- * CLI backends: **CPU**, **CUDA**, **Vulkan** (via ``--backend {cpu|cuda|vulkan|auto}``) * GGUF quantized models (Q4_K, Q8_0, F16, F32) for flexible speed/quality trade-offs * OpenAI-compatible HTTP server with streaming (audio / SSE modes) * Voice registration and management via API * Optional Bearer token authentication (``--api-key``) * Benchmark scripts for performance profiling * Experimental WASM/Emscripten web playground (see ``docs/wasm_playground.md`` in repo) Performance ----------- Benchmarked on NVIDIA RTX 4060 Ti (CUDA) and Intel i5-12600K (CPU, 8 threads), inference timesteps = 10: .. list-table:: CUDA Inference (RTF, lower is better) :widths: 20 15 20 20 20 :header-rows: 1 * - Model - Quant - Model Only - Without Encode - Full Pipeline * - VoxCPM 1.5 - Q8_0 - 0.320 - 0.411 - 0.596 * - VoxCPM 1.5 - F16 - 0.352 - 0.442 - 0.648 * - VoxCPM-0.5B - F16 - 0.390 - 0.428 - 0.567 .. list-table:: CPU Inference (RTF, lower is better) :widths: 20 15 20 20 20 :header-rows: 1 * - Model - Quant - Model Only - Without Encode - Full Pipeline * - VoxCPM 1.5 - Q8_0 - 2.086 - 2.982 - 4.291 * - VoxCPM-0.5B - Q4_K - 1.826 - 2.219 - 3.609 Prerequisites ------------- * CMake >= 3.14 * C++ compiler with C++17 support * (Optional) CUDA toolkit for GPU acceleration * (Optional) Vulkan SDK for Vulkan backend Installation ------------ Build from source: .. code-block:: bash git clone https://github.com/bluryar/VoxCPM.cpp.git cd VoxCPM.cpp # CPU build cmake -B build cmake --build build # CUDA build cmake -B build-cuda -DVOXCPM_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES=89 \ -DVOXCPM_BUILD_BENCHMARK=OFF \ -DVOXCPM_BUILD_TESTS=OFF cmake --build build-cuda Download GGUF weights from Hugging Face: .. code-block:: bash # Example: download VoxCPM 1.5 Q8_0 with AudioVAE F16 huggingface-cli download bluryar/VoxCPM-GGUF \ --include "voxcpm1.5-q8_0-audiovae-f16.gguf" \ --local-dir models/ Basic usage (CLI) ----------------- The ``voxcpm_tts`` binary supports text-to-speech and voice cloning: .. code-block:: bash # Basic TTS ./build/examples/voxcpm_tts \ --model-path models/voxcpm1.5-q8_0-audiovae-f16.gguf \ --backend auto \ --threads 8 \ --text "Hello, this is VoxCPM running in C++." \ --output output.wav # Voice cloning ./build/examples/voxcpm_tts \ --model-path models/voxcpm1.5-q8_0-audiovae-f16.gguf \ --backend auto \ --prompt-audio ref_voice.wav \ --prompt-text "Reference transcript" \ --text "Cloned voice speaking new text." \ --output output.wav \ --inference-timesteps 10 \ --cfg-value 2.0 HTTP Server ----------- The ``voxcpm-server`` provides an OpenAI-compatible ``/v1/audio/speech`` endpoint: .. code-block:: bash ./build/examples/voxcpm-server \ --model-path models/voxcpm1.5-q8_0-audiovae-f16.gguf \ --model-name voxcpm-1.5 \ --voice-dir voices/ \ --disable-auth \ --host 0.0.0.0 --port 8080 Register a voice and synthesize: .. code-block:: bash # Register a voice curl -X POST http://localhost:8080/v1/voices \ -F "id=my_voice" \ -F "audio=@ref_voice.wav" \ -F "text=Reference transcript" # Synthesize with registered voice curl http://localhost:8080/v1/audio/speech \ -H "Content-Type: application/json" \ -d '{"model":"voxcpm-1.5","voice":"my_voice","input":"Hello world"}' \ --output speech.wav .. note:: The examples above use ``--disable-auth``. When authentication is enabled, add ``-H "Authorization: Bearer "`` to all requests. Key server options: * ``--api-key`` — set Bearer token for authentication * ``--disable-auth`` — disable Bearer token authentication * ``--max-queue`` — maximum queued requests (returns 503 when full) * ``--model-name`` — model identifier used in API requests .. warning:: The HTTP server is intended for development and testing. Use ``--api-key`` before exposing to untrusted networks. Troubleshooting --------------- Binary not found after build ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Built binaries are located under ``build/examples/``, not ``build/bin/``. For CUDA builds, look in ``build-cuda/examples/``. Model loading errors ^^^^^^^^^^^^^^^^^^^^ Ensure you are passing a single ``.gguf`` file path (not a directory) to ``--model-path``.