Knowledge Distillation For Model Compression

1

DeepSpeedFramework60/100

via “model compression through pruning and distillation”

Microsoft's distributed training library — ZeRO optimizer, trillion-parameter scale, RLHF.

Unique: Combines structured pruning with knowledge distillation; supports both unstructured and structured sparsity patterns with automatic fine-tuning to recover accuracy

vs others: More integrated than separate pruning/distillation tools; automatic fine-tuning reduces manual tuning effort

2

TensorFlow LiteFramework60/100

via “model size reduction via structured pruning and sparsity”

Lightweight ML inference for mobile and edge devices.

Unique: Structured pruning removes entire filters/channels (not individual weights) to maintain hardware efficiency and avoid sparse tensor overhead. Uses magnitude-based or gradient-based importance scoring to identify prunable structures, then applies iterative fine-tuning to recover accuracy. Integrates with quantization pipeline for cumulative compression.

vs others: More hardware-efficient than unstructured pruning (which requires sparse tensor libraries) and more effective than simple weight decay regularization. Requires fine-tuning unlike quantization, but achieves higher compression ratios (30-50% vs. 4x from quantization alone).

3

SmolLMModel59/100

via “knowledge distillation and model compression for downstream tasks”

Hugging Face's small model family for on-device use.

Unique: SmolLM's curated training data provides a high-quality teacher signal for distillation — student models distilled from SmolLM achieve better generalization than those distilled from generic large models; supports both response-based and feature-based distillation strategies

vs others: Models distilled from SmolLM 1.7B outperform models distilled from Llama 2 7B at equivalent student size due to better data quality, and distilled SmolLM students are 2-3x smaller than TinyLlama while maintaining comparable performance

4

all-MiniLM-L6-v2Model58/100

via “efficient-inference-with-model-distillation”

sentence-similarity model by undefined. 23,35,18,673 downloads.

Unique: Uses asymmetric distillation where student (6 layers) learns from teacher (12 layers) via MSE loss on hidden states and attention patterns, not just final embeddings; preserves semantic structure while reducing depth, enabling both speed and quality retention

vs others: Faster inference than full BERT-base (5-10x) and smaller than full models (22.7M vs 110M params), though slower than extreme compression techniques (TinyBERT, MobileBERT) which sacrifice more quality; better quality-to-speed trade-off than quantization-only approaches

5

Llama 3.1 405BModel57/100

via “model distillation and knowledge transfer to smaller models”

Largest open-weight model at 405B parameters.

Unique: 405B enables distillation at unprecedented scale in open source, allowing creation of smaller models that inherit 405B's capabilities through synthetic data generation and knowledge transfer, previously unavailable in open-source ecosystem

vs others: Larger model scale enables higher-quality synthetic data and more effective distillation than smaller open-source models; however, inference cost for distillation is higher than proprietary distillation services

6

gpt2Model56/100

text-generation model by undefined. 1,60,37,172 downloads.

Unique: Enables knowledge transfer from larger teacher (GPT-2) to smaller student via soft target matching, preserving linguistic knowledge while reducing parameters — complementary to quantization for extreme compression

vs others: More effective than quantization alone for large compression ratios (5-10x), but requires training vs quantization's post-hoc approach — best combined with quantization for maximum compression

7

llmcompressorRepository56/100

via “large language model compression toolkit”

Toolkit for LLM quantization, pruning, and distillation.

Unique: llmcompressor uniquely bridges research-grade compression algorithms with production-ready inference engines, making it accessible for practical deployment.

vs others: Unlike other compression tools, llmcompressor is specifically designed for seamless integration with vLLM and Hugging Face, enhancing its usability for developers.

8

sentence-transformersRepository56/100

via “model-quantization-and-optimization-for-inference”

Framework for sentence embeddings and semantic search.

Unique: unknown — insufficient data on quantization implementation details and supported techniques

vs others: unknown — insufficient data to compare quantization approach against alternatives

9

ai-notesRepository49/100

via “small models and efficient ai tracking”

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.

Unique: Tracks the full spectrum of model efficiency techniques (quantization, distillation, pruning, architecture search) and their impact on model capabilities, rather than treating efficiency as a single dimension

vs others: More comprehensive than individual model documentation because it covers the landscape of efficient models, but less detailed than specialized optimization frameworks

10

nllb-200-distilled-600MModel48/100

via “distilled transformer inference with knowledge transfer”

translation model by undefined. 13,09,929 downloads.

Unique: Applies knowledge distillation specifically to the M2M-100 architecture, preserving the multilingual shared embedding space while reducing parameters by 82%. Uses logit matching and intermediate layer alignment to transfer the teacher's translation knowledge, enabling competitive performance on 200 language pairs with a single 600M-parameter model.

vs others: Smaller than full NLLB-200 (600M vs 3.3B) with faster inference than uncompressed models, but slower and lower quality than language-specific models fine-tuned for single pairs; trade-off is worthwhile for multilingual coverage on resource-constrained devices.

11

distilroberta-baseModel47/100

via “knowledge-distillation-from-roberta-base”

fill-mask model by undefined. 10,73,316 downloads.

Unique: Distilled from RoBERTa-base using standard knowledge distillation (MSE loss on hidden states + MLM loss) achieving 95-98% of teacher performance with 66% parameter reduction, representing a favorable compression-accuracy tradeoff compared to training smaller models from scratch

vs others: Maintains RoBERTa's superior pretraining procedure (dynamic masking, longer training) while achieving efficiency comparable to ALBERT or MobileBERT, and outperforms BERT-base distillations due to better teacher model quality

12

opus-mt-zh-enModel44/100

via “inference optimization via model quantization and pruning support”

translation model by undefined. 2,21,448 downloads.

Unique: The Marian architecture's encoder-decoder simplicity (no custom ops, standard Transformer layers) makes it highly amenable to post-training quantization without custom kernel implementations. Unlike larger models requiring specialized quantization schemes, opus-mt-zh-en can be quantized using standard PyTorch quantization APIs (torch.quantization.quantize_dynamic) with minimal code changes.

vs others: More quantization-friendly than complex models with custom operations; achieves better quality/latency tradeoff than distilled models because the base model is already relatively small (~300M parameters), leaving less room for compression

13

mobilebert-uncased-squad-v2Model39/100

via “knowledge distillation-based model compression for transfer learning”

question-answering model by undefined. 32,657 downloads.

Unique: MobileBERT uses inverted bottleneck architecture (wide intermediate layers, narrow hidden states) combined with intermediate layer distillation, achieving superior compression compared to simple pruning or quantization. This architectural design is inherently distillation-friendly, enabling efficient knowledge transfer.

vs others: More effective knowledge transfer than DistilBERT (which uses only final layer distillation) due to intermediate layer matching; enables fine-tuning on custom datasets with better accuracy retention than training smaller models from scratch.

14

FlagEmbeddingModel37/100

Retrieval and Retrieval-augmented LLMs

Unique: FlagEmbedding provides retrieval-specific knowledge distillation framework that preserves embedding quality and ranking performance through teacher-student training with contrastive and ranking-aware losses.

vs others: Offers retrieval-optimized distillation compared to generic model compression, maintaining ranking quality while reducing model size.

15

NVIDIA: Llama 3.3 Nemotron Super 49B V1.5Model25/100

via “inference-optimization-via-model-distillation-from-70b-to-49b”

Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and...

Unique: Knowledge distillation from 70B to 49B with agentic-specific post-training preserves tool-calling and RAG performance while reducing parameters by 30%, enabling faster inference than 70B without generic distillation quality loss

vs others: More efficient than running full 70B model while maintaining better reasoning than smaller models like Llama-3.1-8B, though with some capability trade-off vs full 70B

16

AionLabs: Aion-1.0-MiniModel24/100

via “knowledge distillation-based reasoning compression”

Aion-1.0-Mini 32B parameter model is a distilled version of the DeepSeek-R1 model, designed for strong performance in reasoning domains such as mathematics, coding, and logic. It is a modified variant...

Unique: Applies knowledge distillation to compress DeepSeek-R1's reasoning capability into 32B parameters, enabling reasoning-based inference at lower cost and latency than full R1

vs others: More efficient than full R1 (32B vs 671B) while retaining reasoning capability, though with unknown performance trade-offs vs. non-distilled reasoning models

17

Amazon: Nova Premier 1.0Model24/100

via “knowledge distillation for custom model training”

Amazon Nova Premier is the most capable of Amazon’s multimodal models for complex reasoning tasks and for use as the best teacher for distilling custom models.

Unique: Amazon positions Nova Premier specifically as a distillation teacher with optimized output formats and intermediate representations designed for knowledge transfer, rather than as a general-purpose model that happens to support distillation as an afterthought

vs others: Designed from the ground up for distillation workflows with better cost-to-quality ratio than using GPT-4 or Claude as a teacher, making it more economical for teams building custom models at scale

18

On Distillation of Guided Diffusion ModelsProduct23/100

via “two-stage knowledge distillation for guided diffusion models”

* ⭐ 10/2022: [LAION-5B: An open large-scale dataset for training next generation image-text models (LAION-5B)](https://arxiv.org/abs/2210.08402)

Unique: Specifically targets classifier-free guided diffusion by matching the guidance-weighted combined output of two teacher models (conditional + unconditional) rather than distilling single models, enabling 10-256× speedup while preserving guidance quality. Progressive distillation stages allow iterative step reduction without catastrophic quality collapse.

vs others: Achieves 10-256× faster inference than DDIM or DPM-Solver by distilling the guidance mechanism itself rather than just optimizing sampling schedules, but requires access to original training data and pre-trained models unlike general-purpose acceleration methods.

19

OPTModel22/100

via “model distillation and compression for deployment”

Open Pretrained Transformers (OPT) by Facebook is a suite of decoder-only pre-trained transformers. [Announcement](https://ai.meta.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/).

20

Scaling Vision Transformers to 22 Billion Parameters (ViT 22B)Product22/100

via “efficient inference with knowledge distillation from teacher models”

* ⭐ 02/2023: [Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)](https://arxiv.org/abs/2302.05543)

Unique: Combines multiple distillation strategies (response, feature, and relation-based) in a unified framework, enabling flexible compression where different layers can use different distillation targets. Uses attention pattern matching to preserve model interpretability while compressing.

vs others: Achieves 92-95% of teacher accuracy at 20% model size, compared to 85-90% for standard response-based distillation alone. Enables deployment of 1-2B parameter models with near-teacher performance, whereas pruning or quantization alone typically requires 30-40% accuracy sacrifice at equivalent compression ratios.

Top Matches

Also Known As

Company