Deepgram API
APIFreeSpeech-to-text API — Nova-2, real-time streaming, diarization, sentiment, 36+ languages.
Capabilities15 decomposed
real-time streaming speech-to-text with ultra-low latency voice agent optimization
Medium confidenceProcesses live audio streams via WebSocket (WSS) protocol using the Flux model, which includes built-in turn detection and interruption handling optimized for voice agent interactions. Audio is transcribed with sub-100ms latency characteristics, enabling natural conversational flow without perceptible delays. The Flux model automatically detects speaker turns and handles mid-sentence interruptions, reducing the need for external turn-taking logic in voice agent applications.
Flux model includes native turn detection and interruption handling at the model level, eliminating the need for separate silence detection or heuristic-based turn-taking logic. This is built into the inference pipeline rather than post-processing transcripts.
Faster than stitching separate STT + silence detection + LLM orchestration because turn detection is native to the model, reducing latency and complexity in voice agent architectures.
batch speech-to-text transcription with high-accuracy timestamps and keyword boosting
Medium confidenceAccepts pre-recorded audio files via REST API and transcribes them using Nova-3 (monolingual or multilingual) or Enhanced/Base models, returning full transcripts with word-level timestamps and optional keyword boosting via keyterm prompting. Processing is synchronous (response includes full transcript) or can be polled asynchronously. Supports automatic language detection across 45+ languages, with optional language specification to improve accuracy.
Keyterm prompting is implemented as a pre-processing hint to the model, allowing domain-specific vocabulary to be weighted during inference rather than post-processing. This improves accuracy for specialized terms without requiring custom model training.
More accurate than generic STT for domain-specific content because keyterm prompting integrates with the model's inference, whereas competitors often rely on post-processing or require custom model fine-tuning.
cli tool with 28 api commands and mcp server integration
Medium confidenceCommand-line interface for Deepgram API with 28 built-in commands for common tasks (transcription, synthesis, etc.). Includes a Model Context Protocol (MCP) server, enabling integration with AI coding tools and agents (e.g., Claude, Cursor). Allows developers to use Deepgram capabilities directly from the terminal or from AI assistants without writing code.
Includes both a traditional CLI (28 commands) and an MCP server, enabling integration with AI coding assistants without requiring code. MCP server allows Claude or other AI tools to call Deepgram capabilities directly.
More accessible than API-only solutions because CLI enables quick testing and scripting, while MCP integration allows AI assistants to use Deepgram without custom integration code.
concurrency-based rate limiting with tier-specific quotas
Medium confidenceRate limiting is enforced via concurrent connection limits rather than requests-per-second or tokens-per-minute. Different tiers have different concurrency limits: Free (50 REST STT, 150 WSS STT, 45 TTS, 10 Audio Intelligence), Growth (50 REST STT, 225 WSS STT, 60 TTS, 10 Audio Intelligence), Enterprise (custom). Concurrency is tracked per API key and enforced at the connection level.
Uses concurrency-based rate limiting (concurrent connections) rather than request-based (requests/sec) or token-based (tokens/min) limits. This is more suitable for streaming and long-lived connections but requires different capacity planning.
Better suited for streaming and voice agent workloads than request-based rate limiting because it allows long-lived WebSocket connections without penalizing duration, but requires understanding concurrent load patterns.
free tier with $200 credit and no expiration
Medium confidenceDeepgram offers a free tier with $200 in API credits that never expire, no credit card required. Credits can be used across all products (STT, TTS, Audio Intelligence) subject to concurrency limits (50 REST STT, 150 WSS STT, 45 TTS, 10 Audio Intelligence). Free tier is suitable for development, testing, and small-scale production use.
Free tier includes $200 in credits with no expiration date and no credit card required, making it one of the most generous free tiers for voice APIs. Credits apply to all products, not just STT.
More generous than competitors' free tiers (e.g., Google Cloud Speech-to-Text, AWS Transcribe) because credits don't expire and no credit card is required, lowering barriers to entry for developers.
growth tier with 15-20% savings via annual pre-paid credits
Medium confidenceGrowth tier offers annual pre-paid credits with 15-20% discount compared to pay-as-you-go pricing. Minimum commitment is $4K/year. Credits are consumed as audio is processed; unused credits expire at the end of the year (not documented, but standard for pre-paid models). Includes higher concurrency limits than free tier (225 WSS STT vs 150, 60 TTS vs 45).
Offers 15-20% discount for annual pre-paid credits, with higher concurrency limits than free tier. Minimum $4K/year commitment positions this tier for growing applications with predictable workloads.
Better cost structure than pay-as-you-go for predictable workloads, but less flexible than competitors offering monthly commitments or no minimum spend.
enterprise tier with custom concurrency and pricing
Medium confidenceEnterprise tier offers custom concurrency limits, custom pricing, and dedicated support. Suitable for large-scale deployments, mission-critical applications, or organizations with specific compliance requirements (SOC2, HIPAA, GDPR). Requires contacting sales for pricing and terms.
Offers fully custom concurrency limits, pricing, and support, allowing enterprises to negotiate terms based on their specific scale and compliance requirements. Likely includes on-premise or self-hosted options.
Provides the flexibility and compliance guarantees required by large enterprises, but requires sales engagement and lacks transparent pricing compared to competitors with published enterprise pricing.
speaker diarization and multi-speaker attribution
Medium confidenceAutomatically detects and labels multiple speakers in audio, attributing each transcript segment to the correct speaker using speaker diarization algorithms. Works with both real-time streaming (via Flux model with turn detection) and batch processing (via Nova-3 and other models). Returns transcript segments tagged with speaker IDs (e.g., Speaker 1, Speaker 2) and optionally speaker change boundaries with timestamps.
Diarization is built into the STT models (Flux, Nova-3) as a native capability, not a post-processing step. This allows real-time speaker detection during streaming and reduces latency compared to separate diarization pipelines.
Integrated into the transcription model rather than applied as a separate post-processing step, reducing latency and improving accuracy by leveraging acoustic context during inference.
unified voice agent orchestration with stt, llm routing, and tts synthesis
Medium confidenceThe Voice Agent API combines speech-to-text, LLM orchestration, and text-to-speech into a single WebSocket endpoint, eliminating the need to stitch together separate services. Developers define a business logic handler that receives transcribed user input and returns text to be synthesized back to the user. The platform handles audio I/O, turn detection (via Flux), and state management across the conversation lifecycle.
Single WebSocket endpoint handles the full voice agent lifecycle (STT → LLM → TTS) with built-in turn detection and interruption handling, reducing the number of service integrations and network round-trips compared to stitching separate APIs.
Simpler and lower-latency than orchestrating separate STT, LLM, and TTS services because it eliminates inter-service communication and manages state internally, making it ideal for teams without dedicated voice infrastructure.
text-to-speech synthesis with streaming and batch modes
Medium confidenceConverts text to natural-sounding speech via REST API or WebSocket, supporting both single-request synthesis and continuous streaming of text chunks. Supports multiple voices and languages (exact count not documented). Can be used standalone or as part of the Voice Agent API. Streaming mode allows real-time audio playback as text is generated, reducing perceived latency in interactive applications.
Supports both REST (batch) and WebSocket (streaming) modes, allowing developers to choose between simplicity (REST) and low-latency interactivity (WebSocket streaming). Streaming mode enables real-time audio playback without waiting for full synthesis.
Streaming TTS via WebSocket reduces perceived latency compared to batch REST APIs because audio playback can begin before synthesis completes, improving user experience in interactive voice applications.
audio intelligence extraction (sentiment, topics, summarization)
Medium confidenceAnalyzes transcribed or raw audio to extract metadata including sentiment analysis, topic detection, and automatic summarization. Operates via REST API on pre-recorded audio or transcripts. Returns structured data with sentiment labels (positive/negative/neutral), detected topics/themes, and abstractive summaries. Implementation details are not documented; likely uses post-processing on transcripts or parallel audio analysis.
Combines sentiment, topic detection, and summarization into a single Audio Intelligence endpoint, allowing batch analysis of multiple metadata types without separate API calls. Concurrency limits (10) suggest this is a resource-intensive operation.
Integrated audio analysis reduces the need to send transcripts to separate NLP services, keeping audio data within Deepgram's infrastructure and potentially improving latency for teams already using Deepgram for STT.
automatic language detection and multilingual transcription
Medium confidenceAutomatically detects the language of incoming audio and transcribes it using the appropriate language model. Supports 45+ languages for STT. Developers can optionally specify a language to improve accuracy or force a specific language model. Language detection is performed at the model level during inference, not as a separate preprocessing step.
Language detection is performed at the model inference level, not as a separate preprocessing step, allowing simultaneous detection and transcription in a single pass. This reduces latency compared to two-stage pipelines.
Faster than separate language detection + transcription pipelines because detection and transcription occur in a single model pass, reducing latency and API calls for multilingual applications.
smart formatting and readability optimization
Medium confidencePost-processes raw transcripts to improve readability by adding punctuation, capitalization, and formatting (e.g., converting numbers to words, formatting currency). Applied automatically during transcription or as an optional post-processing step. Improves transcript quality without requiring manual editing or custom grammar rules.
Smart formatting is applied as a post-processing step on the transcript, not during audio inference, allowing it to be toggled on/off without re-transcribing. Uses rule-based formatting rather than ML-based approaches.
Reduces manual editing time compared to raw transcripts, but less flexible than custom formatting rules; best for standard use cases where default formatting rules apply.
high-accuracy timestamps with word-level timing
Medium confidenceReturns precise start and end timestamps for each word in the transcript, enabling synchronization with video, highlighting, or interactive playback. Timestamps are generated during inference and included in the JSON response. Accuracy is highest with Enhanced and Base models ($0.0165/min and $0.0145/min respectively) which are optimized for timing precision.
Word-level timestamps are generated during inference and included in the base response, not as a separate post-processing step. Enhanced and Base models are specifically optimized for timing accuracy, allowing developers to trade cost for precision.
More accurate than post-processing timestamps from raw transcripts because timing is computed during inference with full acoustic context, enabling reliable video synchronization and interactive playback.
multi-sdk integration with native language bindings
Medium confidenceProvides native SDKs for Python, JavaScript (Node.js/Browser), Go, .NET (C#), and Java, each with language-idiomatic APIs and error handling. SDKs abstract away HTTP/WebSocket details and provide convenience methods for common tasks (e.g., transcribe_file, stream_audio). Maturity and version information not documented.
Native SDKs for 5 major languages with language-idiomatic APIs (e.g., Python dataclasses, JavaScript Promises) rather than generic REST wrappers. Each SDK abstracts protocol details (HTTP vs WebSocket) and provides convenience methods.
More developer-friendly than raw REST/WebSocket APIs because SDKs provide language-native abstractions, error handling, and convenience methods, reducing boilerplate and integration time.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Deepgram API, ranked by overlap. Discovered automatically through the match graph.
Speechmatics
Autonomous speech recognition with industry-leading multilingual accuracy.
Coqui
Generative AI for Voice.
@modelcontextprotocol/server-transcript
MCP App Server for live speech transcription
AssemblyAI API
Speech-to-text with intelligence — Universal-2, summarization, PII redaction, LeMUR for audio LLM.
Pollinations
** - Multimodal MCP server for generating images, audio, and text with no authentication required
MiniMax-MCP
Official MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs.
Best For
- ✓Voice agent developers building conversational AI systems
- ✓Real-time transcription services (live meetings, customer support calls)
- ✓Teams building low-latency voice interfaces
- ✓Content creators and media companies processing recorded audio
- ✓Legal/compliance teams transcribing depositions or interviews
- ✓Researchers analyzing speech data with precise timing requirements
- ✓Teams with domain-specific vocabulary (medical, legal, technical)
- ✓Developers prototyping voice features quickly
Known Limitations
- ⚠Flux model is STT-only; does not include sentiment analysis or topic detection
- ⚠WebSocket connections require persistent network; no automatic reconnection logic documented
- ⚠Turn detection is model-based; may require tuning for non-English languages or accented speech
- ⚠Concurrency limits vary by tier (150 WSS connections on Free, 225 on Growth)
- ⚠REST API is synchronous; no documented async/webhook support for very large files
- ⚠Max file size and duration limits not specified in documentation
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
AI speech-to-text and text-to-speech API. Nova-2 model with industry-leading accuracy. Features real-time streaming, speaker diarization, sentiment analysis, topic detection, and summarization. Supports 36+ languages.
Categories
Alternatives to Deepgram API
This repository contains a hand-curated resources for Prompt Engineering with a focus on Generative Pre-trained Transformer (GPT), ChatGPT, PaLM etc
Compare →World's first open-source, agentic video production system. 12 pipelines, 52 tools, 500+ agent skills. Turn your AI coding assistant into a full video production studio.
Compare →Are you the builder of Deepgram API?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →