Flash Attention 2 Integration For Sub Quadratic Attention Computation

1

transformersFramework65/100

via “attention mechanism implementations with optimization variants”

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Unique: Implements an attention dispatch system (src/transformers/models/*/modeling_*.py) that automatically selects the fastest attention variant (flash attention, memory-efficient attention, standard attention) based on hardware capabilities and input shapes without requiring model code changes

vs others: More efficient than standard PyTorch attention because it automatically selects optimized implementations (flash attention, memory-efficient variants) based on hardware, reducing inference latency by 2-4x without model modifications

2

AutoAWQRepository59/100

via “fused attention and transformer block optimization”

4-bit weight quantization for LLMs on consumer GPUs.

Unique: Implements model-specific fused attention blocks that combine QKV projection, attention computation, and output projection into single kernels, rather than using generic PyTorch operations. This approach reduces kernel launch overhead and enables memory layout optimizations that are impossible with modular code.

vs others: More aggressive fusion than FlashAttention (which fuses attention only); comparable to vLLM's paged attention but with simpler memory management since AutoAWQ doesn't implement paging.

3

ExLlamaV2Repository58/100

via “flash attention 2 integration for sub-quadratic attention computation”

Optimized quantized LLM inference for consumer GPUs — EXL2/GPTQ, flash attention, memory-efficient.

Unique: Directly integrates the Flash Attention 2 CUDA kernels (from Dao et al., 2023) which fuse QK^T computation, softmax, and value multiplication into a single kernel with block-wise tiling. This avoids materializing the full NxN attention matrix and reduces memory bandwidth by 10x compared to standard attention.

vs others: Achieves 2-3x faster attention computation than standard PyTorch attention and 10x lower memory usage because Flash Attention 2 fuses operations into a single kernel, whereas standard implementations materialize the full NxN attention matrix which becomes prohibitive for long sequences.

4

torchtuneRepository58/100

via “attention mechanism variants with grouped query attention (gqa) and flash attention support”

PyTorch-native LLM fine-tuning library.

Unique: Integrates flash attention as an optional optimization that is automatically used when available, with fallback to standard PyTorch attention. GQA is implemented as a configurable attention variant that reduces KV-cache by sharing keys/values across query heads.

vs others: More efficient than standard PyTorch attention because flash attention reduces memory bandwidth, but requires specific hardware and CUDA versions unlike portable attention implementations.

5

AutoGPTQRepository58/100

via “fused attention module optimization for quantized models”

GPTQ-based LLM quantization with fast CUDA inference.

Unique: Integrates fused attention kernels (flash-attention style) into quantized model implementations, combining query-key-dot-product, softmax, and value-multiplication into a single GPU kernel. Fused attention is automatically selected during inference for supported architectures, reducing memory bandwidth and latency without API changes.

vs others: Faster than standard attention on quantized models because it avoids materializing intermediate attention matrices, and more memory-efficient than unfused attention for long-context inference. Automatic kernel selection eliminates manual optimization code.

6

FLUX.1-devModel51/100

via “inference optimization with quantization and memory-efficient attention”

text-to-image model by undefined. 7,33,924 downloads.

Unique: Implements post-training quantization without retraining, enabling efficient deployment on consumer hardware; integrates Flash Attention 2 kernel fusion for 20-30% latency reduction with minimal quality loss

vs others: More practical than distillation-based approaches because no retraining required; more efficient than naive quantization because it uses learned quantization scales; faster than standard attention because Flash Attention uses fused kernels

7

ModernBERT-baseModel49/100

via “efficient transformer inference with flash attention optimization”

fill-mask model by undefined. 13,80,835 downloads.

Unique: Integrates Flash Attention v2 at the transformer block level with ALiBi positional encoding, avoiding the need for rotary embeddings and enabling seamless substitution into standard BERT-compatible fine-tuning pipelines without code changes

vs others: Achieves 2-3x faster inference and 40-50% lower peak memory than standard PyTorch attention while maintaining exact BERT API compatibility, unlike custom attention implementations that require adapter code

8

make-a-video-pytorchFramework46/100

via “spatiotemporal attention with cross-frame relationships”

Implementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch

Unique: Combines spatial and temporal attention in a unified module rather than applying them sequentially, enabling direct modeling of spatiotemporal relationships; integrates Flash Attention for kernel-fused computation reducing memory bandwidth bottlenecks

vs others: More memory-efficient than standard multi-head attention (40-50% reduction with Flash Attention) while capturing richer temporal dependencies than frame-independent spatial attention, enabling longer coherent video generation

9

vllmPlatform42/100

via “attention backend selection with flashattention and flashinfer optimization”

A high-throughput and memory-efficient inference and serving engine for LLMs

Unique: Implements automatic attention backend selection through runtime benchmarking that tests available backends (FlashAttention, FlashInfer, standard) and selects the fastest option. Supports platform-specific optimizations (ROCm attention kernels, TPU attention) with graceful fallback to standard attention.

vs others: Achieves 2-4x faster attention computation vs. standard PyTorch attention through FlashAttention/FlashInfer; automatic selection eliminates manual tuning and adapts to hardware changes without code modification.

10

ruvectorRepository39/100

via “flashattention-3 optimized attention computation”

Self-learning vector database for Node.js — hybrid search, Graph RAG, FlashAttention-3, HNSW, 50+ attention mechanisms

Unique: Brings FlashAttention-3 (typically found in LLM inference frameworks) into the vector DB layer for embedding refinement, whereas competitors treat embeddings as static inputs

vs others: More memory-efficient than naive attention implementations; comparable to Hugging Face Transformers' FlashAttention but integrated into vector search pipeline

11

unslothWeb App39/100

via “custom-triton-kernel-accelerated-attention-dispatch”

Web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.

Unique: Implements a unified attention dispatch system that automatically selects between FlashAttention, PagedAttention, and standard implementations at runtime based on sequence length and hardware, with custom Triton kernels for LoRA and quantization-aware attention that integrate seamlessly into the transformers library's model loading pipeline via monkey-patching

vs others: Faster than vLLM for training (which optimizes inference) and more memory-efficient than standard transformers because it patches attention at the kernel level rather than relying on PyTorch's default CUDA implementations

12

Open-Sora-v2Model38/100

via “inference optimization through attention mechanism acceleration”

text-to-video model by undefined. 16,568 downloads.

Unique: Provides runtime-configurable attention optimization flags that can be toggled without retraining, allowing users to trade off speed vs. quality based on their hardware and latency constraints. Integrates both Flash Attention (NVIDIA-native, fastest) and xFormers (cross-platform, more flexible) backends with automatic fallback.

vs others: More flexible than models with baked-in attention optimizations because users can enable/disable optimizations at runtime, and faster than naive implementations by 2-4x due to fused kernels and reduced memory bandwidth.

13

torchFramework32/100

via “attention mechanism optimization and transformer-specific kernels”

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Unique: Provides hardware-specific fused attention kernels (flash attention variants) with automatic selection based on input shapes and device, integrated with model compilation for end-to-end optimization. Reduces memory bandwidth and kernel launch overhead.

vs others: More efficient than unfused attention because kernel fusion reduces memory bandwidth by 50-70%, while more portable than hand-written flash attention because automatic selection handles different hardware and input shapes.

14

loraModel32/100

via “xformers memory-efficient attention integration”

Using Low-rank adaptation to quickly fine-tune diffusion models.

Unique: Provides automatic kernel replacement for standard PyTorch attention with XFormers flash attention, reducing memory complexity from O(n²) to O(n) without code changes. Integrates via monkeypatch at model initialization, enabling transparent optimization.

vs others: Achieves 20-40% faster training and 30-50% lower peak memory than standard PyTorch attention; enables training on 6GB GPUs that would otherwise require 12GB+ with standard attention.

15

UnslothFramework30/100

via “flash attention 2 integration for efficient attention computation”

A Python library for fine-tuning LLMs [#opensource](https://github.com/unslothai/unsloth).

Unique: Automatic architecture detection and seamless replacement of standard attention with Flash Attention 2 kernels without requiring model code changes, with fallback to standard attention on unsupported hardware

vs others: Simpler integration than manual Flash Attention 2 patching, with automatic architecture detection that works across Llama, Mistral, Qwen, and other standard models, achieving 2-4x attention speedup vs 1.5-2x for naive kernel fusion

16

Qwen: Qwen3.5-35B-A3BModel24/100

via “linear attention mechanism for long-context processing”

The Qwen3.5 Series 35B-A3B is a native vision-language model designed with a hybrid architecture that integrates linear attention mechanisms and a sparse mixture-of-experts model, achieving higher inference efficiency. Its overall...

Unique: Uses linear attention kernels to achieve O(n) complexity instead of O(n²), enabling the model to process longer video sequences and higher-resolution images than standard attention-based vision-language models while maintaining reasonable memory footprint during inference.

vs others: Scales to longer contexts and higher resolutions than dense attention models like standard Qwen-VL or LLaVA, with significantly lower memory overhead during inference, though potentially with slight quality trade-offs in attention pattern expressivity.

Top Matches

Also Known As

Company