Efficient Inference With Model Distillation

1

DeepSpeedFramework60/100

via “model compression through pruning and distillation”

Microsoft's distributed training library — ZeRO optimizer, trillion-parameter scale, RLHF.

Unique: Combines structured pruning with knowledge distillation; supports both unstructured and structured sparsity patterns with automatic fine-tuning to recover accuracy

vs others: More integrated than separate pruning/distillation tools; automatic fine-tuning reduces manual tuning effort

2

SmolLMModel59/100

via “knowledge distillation and model compression for downstream tasks”

Hugging Face's small model family for on-device use.

Unique: SmolLM's curated training data provides a high-quality teacher signal for distillation — student models distilled from SmolLM achieve better generalization than those distilled from generic large models; supports both response-based and feature-based distillation strategies

vs others: Models distilled from SmolLM 1.7B outperform models distilled from Llama 2 7B at equivalent student size due to better data quality, and distilled SmolLM students are 2-3x smaller than TinyLlama while maintaining comparable performance

3

all-MiniLM-L6-v2Model58/100

via “efficient-inference-with-model-distillation”

sentence-similarity model by undefined. 23,35,18,673 downloads.

Unique: Uses asymmetric distillation where student (6 layers) learns from teacher (12 layers) via MSE loss on hidden states and attention patterns, not just final embeddings; preserves semantic structure while reducing depth, enabling both speed and quality retention

vs others: Faster inference than full BERT-base (5-10x) and smaller than full models (22.7M vs 110M params), though slower than extreme compression techniques (TinyBERT, MobileBERT) which sacrifice more quality; better quality-to-speed trade-off than quantization-only approaches

4

DeepSeek R1Model57/100

via “reasoning model distillation to smaller parameter scales”

Open-source reasoning model matching OpenAI o1.

Unique: Applies distillation to reasoning models across 6 different scales (1.5B-70B), which is rare for frontier reasoning models. Most competitors only offer single-size deployment.

vs others: Provides multiple distilled sizes enabling flexible deployment, whereas o1 only offers cloud API access at fixed capability level.

5

Llama 3.1 405BModel57/100

via “model distillation and knowledge transfer to smaller models”

Largest open-weight model at 405B parameters.

Unique: 405B enables distillation at unprecedented scale in open source, allowing creation of smaller models that inherit 405B's capabilities through synthetic data generation and knowledge transfer, previously unavailable in open-source ecosystem

vs others: Larger model scale enables higher-quality synthetic data and more effective distillation than smaller open-source models; however, inference cost for distillation is higher than proprietary distillation services

6

roberta-baseModel53/100

via “efficient inference via model quantization and distillation”

fill-mask model by undefined. 1,90,34,963 downloads.

Unique: RoBERTa-base's 110M parameters and 12-layer architecture provide good compression targets — distilled models retain 95%+ accuracy while achieving 3-4x speedup, and INT8 quantization is particularly effective due to the model's learned robustness to weight perturbations from improved pretraining

vs others: More amenable to quantization than BERT due to improved pretraining; better compression targets than larger models (RoBERTa-large) while maintaining competitive accuracy; distilled RoBERTa variants outperform DistilBERT on most benchmarks

7

distilbert-base-multilingual-cased-sentiments-studentModel49/100

via “efficient-inference-with-model-distillation”

text-classification model by undefined. 6,63,335 downloads.

Unique: Combines DistilBERT's architectural compression (6 vs 12 layers, shared attention heads) with knowledge distillation from a stronger DeBERTa-v3 teacher, achieving both size reduction and maintained accuracy. Supports ONNX export for hardware-agnostic optimization, enabling deployment across CPUs, GPUs, and specialized inference accelerators.

vs others: Smaller and faster than full multilingual BERT/DeBERTa models while maintaining better accuracy than lightweight alternatives like TinyBERT, making it ideal for production systems balancing speed, accuracy, and resource constraints.

8

nllb-200-distilled-600MModel48/100

via “distilled transformer inference with knowledge transfer”

translation model by undefined. 13,09,929 downloads.

Unique: Applies knowledge distillation specifically to the M2M-100 architecture, preserving the multilingual shared embedding space while reducing parameters by 82%. Uses logit matching and intermediate layer alignment to transfer the teacher's translation knowledge, enabling competitive performance on 200 language pairs with a single 600M-parameter model.

vs others: Smaller than full NLLB-200 (600M vs 3.3B) with faster inference than uncompressed models, but slower and lower quality than language-specific models fine-tuned for single pairs; trade-off is worthwhile for multilingual coverage on resource-constrained devices.

9

distilroberta-baseModel47/100

via “knowledge-distillation-from-roberta-base”

fill-mask model by undefined. 10,73,316 downloads.

Unique: Distilled from RoBERTa-base using standard knowledge distillation (MSE loss on hidden states + MLM loss) achieving 95-98% of teacher performance with 66% parameter reduction, representing a favorable compression-accuracy tradeoff compared to training smaller models from scratch

vs others: Maintains RoBERTa's superior pretraining procedure (dynamic masking, longer training) while achieving efficiency comparable to ALBERT or MobileBERT, and outperforms BERT-base distillations due to better teacher model quality

10

nli-MiniLM2-L6-H768Model44/100

via “distilled transformer inference with reduced parameter footprint”

zero-shot-classification model by undefined. 2,58,745 downloads.

Unique: Distilled from RoBERTa-Large specifically for NLI tasks using knowledge distillation, achieving 15x parameter reduction while maintaining >90% of teacher model accuracy on SNLI/MultiNLI benchmarks — most lightweight NLI alternatives either use non-distilled architectures or sacrifice accuracy more severely

vs others: Faster CPU inference than full-size cross-encoders (RoBERTa-Large, BERT-Large) by 3-5x; more accurate than simple bi-encoder baselines on entailment tasks due to cross-encoder architecture, despite smaller size

11

FlagEmbeddingModel37/100

via “knowledge distillation for model compression”

Retrieval and Retrieval-augmented LLMs

Unique: FlagEmbedding provides retrieval-specific knowledge distillation framework that preserves embedding quality and ranking performance through teacher-student training with contrastive and ranking-aware losses.

vs others: Offers retrieval-optimized distillation compared to generic model compression, maintaining ranking quality while reducing model size.

12

HunyuanVideo-1.5Model35/100

via “step distillation for reduced diffusion iterations”

HunyuanVideo-1.5: A leading lightweight video generation model

Unique: Uses knowledge distillation to train a student model that predicts multi-step trajectories, rather than simple output matching. The student learns to approximate the full diffusion process in fewer steps by matching the teacher's intermediate representations, not just final outputs.

vs others: Faster than DDIM or other fast samplers because it's trained specifically for few-step generation, versus generic acceleration techniques that apply to any diffusion model.

13

NVIDIA: Llama 3.3 Nemotron Super 49B V1.5Model25/100

via “inference-optimization-via-model-distillation-from-70b-to-49b”

Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and...

Unique: Knowledge distillation from 70B to 49B with agentic-specific post-training preserves tool-calling and RAG performance while reducing parameters by 30%, enabling faster inference than 70B without generic distillation quality loss

vs others: More efficient than running full 70B model while maintaining better reasoning than smaller models like Llama-3.1-8B, though with some capability trade-off vs full 70B

14

Amazon: Nova Premier 1.0Model24/100

via “knowledge distillation for custom model training”

Amazon Nova Premier is the most capable of Amazon’s multimodal models for complex reasoning tasks and for use as the best teacher for distilling custom models.

Unique: Amazon positions Nova Premier specifically as a distillation teacher with optimized output formats and intermediate representations designed for knowledge transfer, rather than as a general-purpose model that happens to support distillation as an afterthought

vs others: Designed from the ground up for distillation workflows with better cost-to-quality ratio than using GPT-4 or Claude as a teacher, making it more economical for teams building custom models at scale

15

AionLabs: Aion-1.0-MiniModel24/100

via “knowledge distillation-based reasoning compression”

Aion-1.0-Mini 32B parameter model is a distilled version of the DeepSeek-R1 model, designed for strong performance in reasoning domains such as mathematics, coding, and logic. It is a modified variant...

Unique: Applies knowledge distillation to compress DeepSeek-R1's reasoning capability into 32B parameters, enabling reasoning-based inference at lower cost and latency than full R1

vs others: More efficient than full R1 (32B vs 671B) while retaining reasoning capability, though with unknown performance trade-offs vs. non-distilled reasoning models

16

DeepSeek: R1 Distill Qwen 32BModel24/100

via “knowledge distillation-based reasoning transfer”

DeepSeek R1 Distill Qwen 32B is a distilled large language model based on [Qwen 2.5 32B](https://huggingface.co/Qwen/Qwen2.5-32B), using outputs from [DeepSeek R1](/deepseek/deepseek-r1). It outperforms OpenAI's o1-mini across various benchmarks, achieving new...

Unique: Uses knowledge distillation to transfer R1's reasoning capability to a 32B model, enabling R1-quality reasoning at 1/3 parameter count through supervised fine-tuning on R1 outputs

vs others: More efficient than full R1 while maintaining reasoning quality, and more transparent than black-box reasoning models like o1 through explicit reasoning traces

17

On Distillation of Guided Diffusion ModelsProduct23/100

via “two-stage knowledge distillation for guided diffusion models”

* ⭐ 10/2022: [LAION-5B: An open large-scale dataset for training next generation image-text models (LAION-5B)](https://arxiv.org/abs/2210.08402)

Unique: Specifically targets classifier-free guided diffusion by matching the guidance-weighted combined output of two teacher models (conditional + unconditional) rather than distilling single models, enabling 10-256× speedup while preserving guidance quality. Progressive distillation stages allow iterative step reduction without catastrophic quality collapse.

vs others: Achieves 10-256× faster inference than DDIM or DPM-Solver by distilling the guidance mechanism itself rather than just optimizing sampling schedules, but requires access to original training data and pre-trained models unlike general-purpose acceleration methods.

18

OPTModel22/100

via “model distillation and compression for deployment”

Open Pretrained Transformers (OPT) by Facebook is a suite of decoder-only pre-trained transformers. [Announcement](https://ai.meta.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/).

19

Scaling Vision Transformers to 22 Billion Parameters (ViT 22B)Product22/100

via “efficient inference with knowledge distillation from teacher models”

* ⭐ 02/2023: [Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)](https://arxiv.org/abs/2302.05543)

Unique: Combines multiple distillation strategies (response, feature, and relation-based) in a unified framework, enabling flexible compression where different layers can use different distillation targets. Uses attention pattern matching to preserve model interpretability while compressing.

vs others: Achieves 92-95% of teacher accuracy at 20% model size, compared to 85-90% for standard response-based distillation alone. Enables deployment of 1-2B parameter models with near-teacher performance, whereas pruning or quantization alone typically requires 30-40% accuracy sacrifice at equivalent compression ratios.

20

Build a DeepSeek Model (From Scratch)Product18/100

via “model distillation and knowledge transfer techniques”

A book about implementing DeepSeek-style LLM architecture, training, and distillation methods.

Unique: Focuses on distillation techniques specifically adapted for DeepSeek architectures rather than generic distillation tutorials; likely covers distillation patterns for DeepSeek's specific architectural features (e.g., distilling mixture-of-experts models, handling attention pattern transfer, preserving reasoning capabilities in student models)

vs others: More targeted than general distillation resources because it addresses the specific challenges of compressing DeepSeek-style models while maintaining their distinctive capabilities, rather than applying generic distillation to arbitrary architectures

Top Matches

Also Known As

Company