Fast Inference With Distilled Model Architecture

1

Stable Diffusion 3.5 LargeModel59/100

via “fast image generation with distilled diffusion steps”

Stability AI's 8B parameter flagship image generation model.

Unique: Applies knowledge distillation to compress diffusion steps from standard schedule to 4 steps while preserving the full 8.1B parameter model, enabling faster inference without architectural changes or separate lightweight model training

vs others: Faster than standard Stable Diffusion 3.5 Large with same parameter count, but slower than purpose-built fast models like LCM-LoRA or consistency models; trades speed for quality more conservatively than extreme distillation approaches

2

all-MiniLM-L6-v2Model58/100

via “efficient-inference-with-model-distillation”

sentence-similarity model by undefined. 23,35,18,673 downloads.

Unique: Uses asymmetric distillation where student (6 layers) learns from teacher (12 layers) via MSE loss on hidden states and attention patterns, not just final embeddings; preserves semantic structure while reducing depth, enabling both speed and quality retention

vs others: Faster inference than full BERT-base (5-10x) and smaller than full models (22.7M vs 110M params), though slower than extreme compression techniques (TinyBERT, MobileBERT) which sacrifice more quality; better quality-to-speed trade-off than quantization-only approaches

3

DeepSeek R1Model57/100

via “reasoning model distillation to smaller parameter scales”

Open-source reasoning model matching OpenAI o1.

Unique: Applies distillation to reasoning models across 6 different scales (1.5B-70B), which is rare for frontier reasoning models. Most competitors only offer single-size deployment.

vs others: Provides multiple distilled sizes enabling flexible deployment, whereas o1 only offers cloud API access at fixed capability level.

4

Llama 3.1 405BModel57/100

via “model distillation and knowledge transfer to smaller models”

Largest open-weight model at 405B parameters.

Unique: 405B enables distillation at unprecedented scale in open source, allowing creation of smaller models that inherit 405B's capabilities through synthetic data generation and knowledge transfer, previously unavailable in open-source ecosystem

vs others: Larger model scale enables higher-quality synthetic data and more effective distillation than smaller open-source models; however, inference cost for distillation is higher than proprietary distillation services

5

distilbert-base-multilingual-cased-sentiments-studentModel49/100

via “efficient-inference-with-model-distillation”

text-classification model by undefined. 6,63,335 downloads.

Unique: Combines DistilBERT's architectural compression (6 vs 12 layers, shared attention heads) with knowledge distillation from a stronger DeBERTa-v3 teacher, achieving both size reduction and maintained accuracy. Supports ONNX export for hardware-agnostic optimization, enabling deployment across CPUs, GPUs, and specialized inference accelerators.

vs others: Smaller and faster than full multilingual BERT/DeBERTa models while maintaining better accuracy than lightweight alternatives like TinyBERT, making it ideal for production systems balancing speed, accuracy, and resource constraints.

6

nllb-200-distilled-600MModel48/100

via “distilled transformer inference with knowledge transfer”

translation model by undefined. 13,09,929 downloads.

Unique: Applies knowledge distillation specifically to the M2M-100 architecture, preserving the multilingual shared embedding space while reducing parameters by 82%. Uses logit matching and intermediate layer alignment to transfer the teacher's translation knowledge, enabling competitive performance on 200 language pairs with a single 600M-parameter model.

vs others: Smaller than full NLLB-200 (600M vs 3.3B) with faster inference than uncompressed models, but slower and lower quality than language-specific models fine-tuned for single pairs; trade-off is worthwhile for multilingual coverage on resource-constrained devices.

7

nli-MiniLM2-L6-H768Model44/100

via “distilled transformer inference with reduced parameter footprint”

zero-shot-classification model by undefined. 2,58,745 downloads.

Unique: Distilled from RoBERTa-Large specifically for NLI tasks using knowledge distillation, achieving 15x parameter reduction while maintaining >90% of teacher model accuracy on SNLI/MultiNLI benchmarks — most lightweight NLI alternatives either use non-distilled architectures or sacrifice accuracy more severely

vs others: Faster CPU inference than full-size cross-encoders (RoBERTa-Large, BERT-Large) by 3-5x; more accurate than simple bi-encoder baselines on entailment tasks due to cross-encoder architecture, despite smaller size

8

HunyuanVideo-1.5Model35/100

via “step distillation for reduced diffusion iterations”

HunyuanVideo-1.5: A leading lightweight video generation model

Unique: Uses knowledge distillation to train a student model that predicts multi-step trajectories, rather than simple output matching. The student learns to approximate the full diffusion process in fewer steps by matching the teacher's intermediate representations, not just final outputs.

vs others: Faster than DDIM or other fast samplers because it's trained specifically for few-step generation, versus generic acceleration techniques that apply to any diffusion model.

9

NVIDIA: Llama 3.3 Nemotron Super 49B V1.5Model25/100

via “inference-optimization-via-model-distillation-from-70b-to-49b”

Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and...

Unique: Knowledge distillation from 70B to 49B with agentic-specific post-training preserves tool-calling and RAG performance while reducing parameters by 30%, enabling faster inference than 70B without generic distillation quality loss

vs others: More efficient than running full 70B model while maintaining better reasoning than smaller models like Llama-3.1-8B, though with some capability trade-off vs full 70B

10

Google: Gemma 4 31B (free)Model25/100

via “dense transformer architecture with efficient inference”

Gemma 4 31B Instruct is Google DeepMind's 30.7B dense multimodal model supporting text and image input with text output. Features a 256K token context window, configurable thinking/reasoning mode, native function...

Unique: Dense 30.7B architecture (vs sparse MoE alternatives) with optimized inference kernels for predictable latency and memory usage, avoiding the routing overhead and variance of mixture-of-experts models

vs others: More predictable than Mixtral 8x7B (sparse MoE) due to no routing variance; more efficient than Llama 70B due to smaller parameter count while maintaining comparable capability

11

On Distillation of Guided Diffusion ModelsProduct25/100

via “latent-space diffusion model distillation”

* ⭐ 10/2022: [LAION-5B: An open large-scale dataset for training next generation image-text models (LAION-5B)](https://arxiv.org/abs/2210.08402)

Unique: Achieves 10-256× speedup on latent-space models by distilling guidance mechanisms within VAE latent space, enabling 1-4 step generation on high-resolution datasets. Leverages VAE compression to reduce computational cost compared to pixel-space distillation.

vs others: 10-256× faster inference than standard Stable Diffusion or DALL-E 2, but requires distillation preprocessing and may sacrifice perceptual quality at extreme step reduction (1 step) compared to non-distilled models.

12

Scaling Vision Transformers to 22 Billion Parameters (ViT 22B)Product24/100

via “efficient inference with knowledge distillation from teacher models”

* ⭐ 02/2023: [Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)](https://arxiv.org/abs/2302.05543)

Unique: Combines multiple distillation strategies (response, feature, and relation-based) in a unified framework, enabling flexible compression where different layers can use different distillation targets. Uses attention pattern matching to preserve model interpretability while compressing.

vs others: Achieves 92-95% of teacher accuracy at 20% model size, compared to 85-90% for standard response-based distillation alone. Enables deployment of 1-2B parameter models with near-teacher performance, whereas pruning or quantization alone typically requires 30-40% accuracy sacrifice at equivalent compression ratios.

13

FLUX.1-schnellModel21/100

FLUX.1-schnell — AI demo on HuggingFace

Top Matches

Also Known As

Company