openai-compatible whisper cli with ctranslate2 acceleration
Provides a drop-in replacement CLI for OpenAI's Whisper that maintains argument and output compatibility while substituting the inference backend with CTranslate2, a quantized model optimization framework. This allows users to swap the binary without changing scripts or workflows, while CTranslate2 handles model quantization, layer fusion, and CPU/GPU optimization under the hood to achieve 4-10x faster inference than the original Whisper implementation.
Unique: Maintains 100% CLI argument compatibility with OpenAI's official Whisper while swapping the inference backend to CTranslate2, enabling existing shell scripts and CI/CD pipelines to gain 4-10x speedup with zero code changes. The architecture uses a thin wrapper that parses OpenAI's argument format, loads pre-quantized CTranslate2 models, and reformats output to match the original JSON schema exactly.
vs alternatives: Faster than native Whisper (4-10x speedup via quantization and layer fusion) and faster than Faster-Whisper (which uses ONNX) on CPU-only systems, while maintaining perfect CLI compatibility unlike alternatives that require argument remapping.
ctranslate2 model quantization and optimization pipeline
Converts standard Whisper PyTorch models (.pt checkpoints) into CTranslate2's optimized binary format, applying techniques like INT8 quantization, layer fusion, and operator-specific optimizations. The conversion process is a one-time offline step that produces a compact, inference-optimized model directory structure that CTranslate2's C++ runtime can load and execute with minimal memory overhead.
Unique: Implements CTranslate2's specialized quantization pipeline specifically tuned for Whisper's encoder-decoder architecture, preserving attention mechanisms and layer normalization precision while aggressively quantizing linear layers. Unlike generic quantization tools, this approach understands Whisper's acoustic feature extraction and uses INT8 quantization selectively to maintain speech recognition accuracy.
vs alternatives: Produces smaller, faster models than ONNX quantization (which adds runtime overhead) and maintains better accuracy than naive INT8 quantization because it applies CTranslate2's Whisper-specific optimization heuristics.
multi-format audio transcription output with format conversion
Transcribes audio to text and automatically converts the output to multiple subtitle and text formats (JSON, VTT, SRT, TSV, TXT) via command-line flags. The implementation parses CTranslate2's segment-level output (which includes timestamps and confidence scores) and formats each into the target schema, handling edge cases like special characters, timing precision, and line-length constraints specific to each format.
Unique: Leverages CTranslate2's native segment-level output (which includes per-segment timestamps, confidence scores, and token-level information) to generate multiple output formats from a single inference pass, avoiding redundant re-processing. The implementation maps CTranslate2's internal segment structure directly to each format's schema without intermediate representations.
vs alternatives: Faster than post-processing transcripts with external tools (ffmpeg-python, pysrt) because conversion happens in-memory without file I/O, and more accurate than regex-based format conversion because it preserves CTranslate2's native timestamp precision.
language detection and automatic model selection
Automatically detects the spoken language in audio using Whisper's multilingual encoder and selects the appropriate language-specific model variant (base, small, medium, large) without requiring manual language specification. The detection uses the first 30 seconds of audio to identify language via the encoder's language classification head, then routes to the corresponding decoder.
Unique: Reuses Whisper's multilingual encoder's language classification head (trained on 99 languages) to perform detection without additional models or API calls, keeping the entire pipeline self-contained. The detection is performed once during the encoder pass and the result is cached to avoid redundant computation.
vs alternatives: Faster than separate language detection APIs (no network latency) and more accurate than heuristic-based detection (e.g., phoneme analysis) because it uses Whisper's native multilingual training.
batch audio processing with parallel inference
Processes multiple audio files sequentially or in parallel using CTranslate2's compute graph optimization and optional GPU acceleration. The CLI accepts a list of input files and processes each through the same model instance, reusing the loaded model in memory to avoid repeated model loading overhead. GPU support (CUDA, Metal) is automatically detected and used if available.
Unique: Leverages CTranslate2's compute graph caching and memory pooling to avoid model reloading overhead when processing multiple files in sequence. The architecture loads the model once, reuses the same inference session across files, and relies on CTranslate2's internal GPU memory management to handle batch processing without explicit parallelization code.
vs alternatives: More efficient than calling the original Whisper CLI in a loop (which reloads the model each time) and simpler than external parallelization frameworks because the model stays resident in memory across files.
cpu and gpu device selection with automatic fallback
Automatically detects available compute devices (CPU, CUDA GPU, Metal GPU) and selects the optimal device for inference. If GPU is unavailable or inference fails on GPU, the system falls back to CPU without user intervention. Device selection is configurable via --device flag (cpu, cuda, auto) and CTranslate2 handles the actual compute graph compilation and execution on the chosen device.
Unique: Delegates device detection and compute graph compilation to CTranslate2's C++ runtime, which has native support for CUDA, Metal, and CPU backends. The CLI wrapper simply passes the device flag to CTranslate2 and relies on its internal device abstraction layer to handle compilation and fallback logic, avoiding redundant device detection code.
vs alternatives: More robust than manual device selection because CTranslate2's runtime handles device-specific optimizations (e.g., CUDA kernel selection, Metal shader compilation) automatically, and simpler than frameworks requiring explicit device context management (PyTorch, TensorFlow).