Capability
First Draft Acceleration
3 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “speculative decoding with draft model acceleration”
A high-throughput and memory-efficient inference and serving engine for LLMs
Unique: Implements rejection sampling-based speculative decoding with support for external draft model servers and variable draft sizes; most alternatives use fixed draft models or require architectural compatibility
vs others: Achieves 2-3x latency reduction with minimal quality loss vs. naive beam search, and supports heterogeneous draft models vs. Medusa's single-head approach