segformer-b5-finetuned-ade-640-640Model40/100 via “multi-scale-contextual-feature-extraction”
image-segmentation model by undefined. 77,998 downloads.
Unique: Implements hierarchical feature extraction via overlapping patch embeddings (4x, 8x, 16x, 32x downsampling stages) with efficient self-attention at each stage, avoiding the computational bottleneck of dense attention on full-resolution features. Pyramid pooling aggregates features across spatial scales before lightweight MLP decoder, enabling efficient context fusion without expensive upsampling.
vs others: More computationally efficient than ViT-based approaches (which apply attention to all patches uniformly) and more flexible than fixed-scale CNN pyramids (ResNet, EfficientNet) because transformer attention adapts to image content; produces richer contextual features than DeepLabV3+ ASPP module due to learned multi-scale aggregation.