pyannote-audio
RepositoryFreeState-of-the-art speaker diarization toolkit
Capabilities12 decomposed
end-to-end speaker diarization with neural segmentation
Medium confidencePerforms speaker diarization by combining neural segmentation models (trained on Pyannote's proprietary datasets) with speaker embedding extraction and clustering. The pipeline uses a two-stage approach: first, a temporal convolutional network (TCN) or transformer-based segmentation model identifies speaker boundaries and speech/non-speech regions frame-by-frame; second, speaker embeddings are extracted and clustered using agglomerative hierarchical clustering with dynamic threshold tuning. The system supports both batch processing and streaming inference modes.
Uses a modular pipeline architecture where segmentation and embedding extraction are decoupled, allowing users to swap pretrained models (e.g., from Hugging Face) and customize clustering thresholds per use case. Implements online/streaming diarization via frame-by-frame processing, unlike batch-only competitors.
Outperforms commercial solutions (Google Cloud Speech-to-Text, AWS Transcribe) on speaker boundary accuracy while remaining open-source and customizable; faster inference than ECAPA-TDNN baselines through optimized PyTorch implementations.
speaker embedding extraction with pretrained neural encoders
Medium confidenceExtracts fixed-dimensional speaker embeddings (typically 192-512 dims) from audio segments using pretrained speaker verification models (e.g., ECAPA-TDNN, ResNet-based architectures). The embeddings capture speaker-specific acoustic characteristics and are designed to be speaker-discriminative while speaker-invariant to content. Embeddings can be extracted at segment or utterance level and are compatible with standard distance metrics (cosine, Euclidean) for downstream clustering or similarity matching.
Provides a modular embedding extraction API that decouples model architecture from inference, allowing users to load custom pretrained encoders from Hugging Face or define their own. Supports batch processing with automatic padding and efficient GPU utilization through PyTorch's native operations.
More flexible than closed-source APIs (Google Cloud Speaker ID, Azure Speaker Recognition) by allowing model swapping and local inference; produces embeddings compatible with standard clustering libraries (scikit-learn, scipy) without vendor lock-in.
visualization and debugging tools for diarization results
Medium confidenceProvides utilities for visualizing diarization results, including speaker timeline plots, embedding space visualizations (t-SNE, UMAP), and spectrogram overlays with speaker labels. Includes debugging tools for analyzing segmentation errors, embedding quality, and clustering decisions. Supports interactive HTML visualizations and static plots for reports. Can overlay ground truth annotations for error analysis.
Provides integrated visualization tools that work directly with diarization outputs (RTTM, embeddings) without requiring external tools. Supports both static (matplotlib) and interactive (plotly) backends, allowing users to choose based on use case.
More convenient than manual visualization using matplotlib; integrates error analysis and ground truth comparison directly into visualization tools; supports interactive exploration unlike static plot libraries.
batch processing and pipeline orchestration for large audio collections
Medium confidenceProvides utilities for processing large collections of audio files in batches with automatic job scheduling, error handling, and result aggregation. Supports parallel processing across multiple CPU cores or GPUs, with configurable batch sizes and queue management. Includes checkpointing to resume interrupted jobs and logging for monitoring progress. Can be integrated with workflow orchestration tools (e.g., Airflow, Prefect) for production pipelines.
Provides a high-level batch processing API that abstracts away parallelization and error handling complexity. Includes checkpointing and resumable job execution, allowing users to process large collections without worrying about job failures.
Simpler than manual multiprocessing setup; integrates checkpointing and error handling natively; more flexible than cloud-based batch processing services by allowing local or on-premise execution.
temporal speaker segmentation with frame-level classification
Medium confidencePerforms frame-level speaker activity detection and speaker change detection using neural segmentation models (TCN or transformer-based) that process audio spectrograms and output per-frame probabilities for speech/non-speech and speaker boundaries. The model operates on fixed-size windows (typically 10-20ms frames) and uses temporal convolutions or attention mechanisms to capture context across frames. Outputs are post-processed (smoothing, peak detection) to produce clean segment boundaries.
Implements a modular segmentation pipeline where frame-level predictions are decoupled from post-processing, allowing users to apply custom smoothing, thresholding, or peak detection strategies. Supports both TCN and transformer-based architectures with configurable receptive fields for different temporal resolutions.
Provides frame-level granularity superior to segment-based approaches (e.g., WebRTC VAD), enabling precise speaker boundary detection; more accurate than rule-based methods (energy thresholding, spectral change detection) through learned representations.
pretrained model management and loading from hugging face hub
Medium confidenceProvides a unified interface for discovering, downloading, and loading pretrained diarization and speaker embedding models from Hugging Face Model Hub. Models are versioned, cached locally, and can be instantiated with a single function call. The system handles model card parsing, dependency resolution, and automatic fallback to CPU if GPU is unavailable. Users can also upload custom models to Hugging Face Hub for sharing and reproducibility.
Integrates tightly with Hugging Face Hub's model versioning and caching system, allowing users to pin specific model versions via Git commit hashes. Provides a Python API that abstracts away Hub authentication and model instantiation complexity.
Simpler than manual model downloading and weight management; more flexible than monolithic model zoos by leveraging Hugging Face's distributed model hosting and community contributions.
agglomerative hierarchical clustering with dynamic threshold tuning
Medium confidenceClusters speaker embeddings using agglomerative hierarchical clustering (bottom-up merging) with dynamic threshold selection based on embedding statistics. The algorithm computes pairwise distances between embeddings (cosine or Euclidean), builds a dendrogram, and cuts at a threshold that maximizes cluster separation. Threshold tuning can be automatic (based on silhouette score, gap statistic) or manual. Supports custom linkage criteria (complete, average, ward) and distance metrics.
Implements dynamic threshold tuning that adapts to embedding statistics (e.g., median pairwise distance, silhouette score), reducing manual hyperparameter tuning. Supports custom linkage criteria and distance metrics, allowing users to experiment with different clustering strategies without reimplementing the algorithm.
More interpretable than k-means or spectral clustering (dendrogram visualization); more flexible than fixed-threshold approaches by automatically adapting to embedding distributions.
streaming/online diarization with incremental speaker updates
Medium confidencePerforms speaker diarization on streaming audio by processing frames incrementally and updating speaker clusters in real-time. The system maintains a running set of speaker embeddings and updates cluster assignments as new frames arrive. Segmentation is performed frame-by-frame, and new speakers are detected by comparing incoming embeddings against existing speaker clusters using a dynamic threshold. Supports both online (single-pass) and semi-online (buffered) modes for latency/accuracy tradeoffs.
Implements a frame-by-frame processing pipeline with incremental embedding extraction and cluster updates, avoiding the need to reprocess entire audio files. Supports configurable buffer sizes and update frequencies, allowing users to trade off latency (smaller buffers) for accuracy (larger buffers).
Enables real-time diarization unlike batch-only approaches; lower latency than cloud-based APIs (Google Cloud, AWS) due to local processing; more accurate than simple voice activity detection + speaker identification baselines.
audio preprocessing and feature extraction (mel-spectrograms, mfccs)
Medium confidenceProvides utilities for converting raw audio waveforms into acoustic features (mel-spectrograms, MFCCs, chromagrams) required by neural models. Handles audio resampling, normalization, windowing, and feature computation using librosa or torchaudio backends. Supports both offline (batch) and online (streaming) feature extraction with configurable window sizes, hop lengths, and frequency ranges. Features are cached and can be reused across multiple model runs.
Provides a modular preprocessing API that supports both librosa and torchaudio backends, allowing users to choose between CPU-based (librosa) and GPU-accelerated (torchaudio) feature extraction. Includes caching and batching optimizations for efficient processing of large audio files.
More flexible than hardcoded preprocessing in monolithic models; supports both offline and streaming modes unlike batch-only feature extractors; GPU acceleration via torchaudio provides 10-100x speedup over CPU-based librosa.
rttm format i/o and annotation management
Medium confidenceReads, writes, and manipulates speaker diarization annotations in RTTM (Rich Transcription Time Marked) format, a standard format for speaker diarization ground truth and predictions. Provides utilities for parsing RTTM files into Python objects, filtering/merging segments, computing metrics (DER, JER, purity, coverage), and exporting results back to RTTM. Supports validation of RTTM files and conversion between RTTM and other formats (JSON, CSV).
Provides a Pythonic API for RTTM manipulation with support for segment filtering, merging, and metric computation. Includes validation utilities to detect malformed RTTM files and conversion tools for interoperability with other annotation formats.
More convenient than manual RTTM parsing; integrates evaluation metrics (DER, JER) directly into the library, avoiding dependency on external evaluation scripts.
multi-gpu and distributed inference support
Medium confidenceEnables distributed inference across multiple GPUs or machines using PyTorch's distributed data parallel (DDP) and model parallel patterns. The system automatically partitions audio files across GPUs, processes segments in parallel, and aggregates results. Supports both data parallelism (same model on multiple GPUs) and model parallelism (large models split across GPUs). Handles synchronization, gradient aggregation, and result merging transparently.
Leverages PyTorch's native distributed training utilities (torch.nn.parallel.DistributedDataParallel) to abstract away synchronization and communication complexity. Provides a high-level API that hides distributed setup details while allowing fine-grained control over parallelization strategy.
Simpler than manual distributed implementation using multiprocessing or Ray; integrates seamlessly with PyTorch ecosystem; more efficient than naive batching across GPUs due to optimized communication patterns.
custom model training and fine-tuning on user data
Medium confidenceProvides a training framework for fine-tuning pretrained diarization models on custom datasets or training models from scratch. Includes data loaders for RTTM-annotated audio, loss functions (e.g., focal loss for imbalanced data), optimization strategies (Adam, SGD with learning rate scheduling), and validation/evaluation loops. Supports mixed-precision training for memory efficiency and gradient accumulation for large batch sizes. Integrates with Weights & Biases for experiment tracking.
Provides a modular training framework with pluggable loss functions, optimizers, and data loaders, allowing users to customize training without reimplementing core logic. Integrates with Weights & Biases for automatic experiment tracking and model versioning.
More flexible than monolithic training scripts; supports mixed-precision training and gradient accumulation for efficient large-scale training; integrates experiment tracking natively, avoiding manual logging.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with pyannote-audio, ranked by overlap. Discovered automatically through the match graph.
speaker-diarization-3.1
automatic-speech-recognition model by undefined. 1,02,42,383 downloads.
speechbrain
All-in-one speech toolkit in pure Python and Pytorch
speaker-diarization-community-1
automatic-speech-recognition model by undefined. 22,16,403 downloads.
Deepgram
Enterprise speech AI with real-time transcription and speaker diarization.
Lugs
Accurately captions and transcribes all audio on your computer and...
Vibe Transcribe
All-in-one solution for effortless audio and video transcription. [#opensource](https://github.com/thewh1teagle/vibe)
Best For
- ✓Speech processing researchers and practitioners
- ✓Teams building meeting transcription or call center analytics systems
- ✓Developers creating speaker-aware audio analysis pipelines
- ✓Organizations processing large-scale audio archives for speaker identification
- ✓ML engineers building speaker identification or verification systems
- ✓Researchers experimenting with speaker embedding spaces and clustering algorithms
- ✓Teams integrating speaker analysis into larger audio processing pipelines
- ✓Developers needing speaker-agnostic representations for downstream tasks
Known Limitations
- ⚠Requires 8+ GB RAM for processing long audio files; memory usage scales with audio duration
- ⚠Clustering quality degrades with >10 speakers in a single file due to embedding space saturation
- ⚠No built-in speaker identification (matching speakers across files); only within-file diarization
- ⚠Inference latency is ~0.5-2x real-time depending on model size and hardware; GPU strongly recommended for production
- ⚠Pretrained models optimized for English and European languages; performance drops significantly on low-resource languages
- ⚠Embeddings are model-specific; switching models requires re-extracting all embeddings
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
Package Details
About
State-of-the-art speaker diarization toolkit
Categories
Alternatives to pyannote-audio
This repository contains a hand-curated resources for Prompt Engineering with a focus on Generative Pre-trained Transformer (GPT), ChatGPT, PaLM etc
Compare →World's first open-source, agentic video production system. 12 pipelines, 52 tools, 500+ agent skills. Turn your AI coding assistant into a full video production studio.
Compare →Are you the builder of pyannote-audio?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →