Efficient Inference Through Encoder Decoder Caching

1

Florence-2Model57/100

via “efficient inference through encoder-decoder caching”

Microsoft's unified model for diverse vision tasks.

Unique: Implements encoder-decoder caching where visual encoder output is computed once and reused across all decoder steps, reducing redundant attention computation and enabling 2-3x faster inference for variable-length outputs

vs others: More efficient than non-cached inference but with higher memory overhead than single-pass models; trade-off between latency and memory usage

2

VibeVoice-Realtime-0.5BModel49/100

via “efficient transformer inference with kv-cache optimization”

text-to-speech model by undefined. 11,52,993 downloads.

Unique: Applies KV-cache optimization specifically to streaming TTS inference, reducing per-token latency from ~200ms to ~20-50ms on consumer GPUs. Combines cache reuse with selective attention masking to maintain streaming properties while avoiding redundant computation.

vs others: Achieves real-time streaming latency comparable to specialized streaming TTS engines (e.g., Coqui, Piper) while maintaining the quality and flexibility of larger transformer-based models.

3

infinity-embAPI37/100

via “request-caching-embedding-deduplication”

Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip.

Unique: Implements transparent request-level caching that deduplicates identical embedding requests before batch formation, reducing unnecessary GPU computation. Cache is keyed by input text hash and supports configurable TTL and size limits.

vs others: More efficient than application-level caching because it deduplicates at the inference layer; faster than vector database caching because it avoids network round-trips; simpler than distributed caching because it's built-in.

4

open-clip-torchRepository27/100

via “embedding caching and efficient batch inference”

Open reproduction of consastive language-image pretraining (CLIP) and related.

Unique: Implements transparent embedding caching with optional disk persistence, allowing practitioners to trade memory for speed without modifying inference code, and supporting both in-memory and external vector database backends

vs others: More efficient than recomputing embeddings repeatedly because it caches results transparently, but requires careful cache management and invalidation strategies for production systems

5

tiktokenRepository22/100

via “efficient in-memory encoding caching”

tiktoken is a fast BPE tokeniser for use with OpenAI's models

Unique: Implements a transparent, thread-safe singleton cache for encoding files that automatically handles lazy-loading and prevents redundant downloads or file I/O. Developers don't need to manually manage cache lifecycle — it's handled transparently by the library.

vs others: More efficient than reloading encodings on every tokenization call because it caches loaded data in memory and uses a singleton pattern to avoid duplicate instances across the application

Top Matches

Also Known As

Company