Object Detection With Transformer Architecture

1

MMDetectionRepository56/100

via “transformer-based detection with deformable attention and query optimization”

OpenMMLab detection toolbox with 300+ models.

Unique: Implements DINO (DETR with Improved deNoising) which adds contrastive learning between positive/negative queries and mixed query selection strategy, achieving state-of-the-art accuracy without hand-crafted components; deformable attention reduces complexity from O(n²) to O(n) by learning spatial offsets to relevant regions

vs others: More elegant than anchor-based detectors because it eliminates hand-crafted anchors and NMS; more efficient than vanilla DETR because deformable attention focuses on relevant regions; better convergence than early DETR variants due to contrastive learning and query optimization

2

table-transformer-detectionModel53/100

via “table-region detection in document images”

object-detection model by undefined. 33,94,499 downloads.

Unique: Uses a DETR (Detection Transformer) architecture specifically fine-tuned for table detection in documents, combining CNN visual feature extraction with transformer attention mechanisms to capture both local table structure and global document context. Unlike traditional region-proposal networks (Faster R-CNN), the transformer decoder directly predicts table locations without intermediate anchor generation, reducing false positives on document backgrounds.

vs others: Outperforms Faster R-CNN and SSD-based table detectors on mixed-content documents because transformer attention can distinguish table boundaries from surrounding text and whitespace more effectively, achieving higher precision on real-world scanned documents.

3

vit-base-patch16-224Model52/100

via “patch-based image classification with vision transformer architecture”

image-classification model by undefined. 47,71,224 downloads.

Unique: Uses pure transformer architecture (no convolutional layers) with learnable patch embeddings and positional encodings, enabling efficient global receptive field from the first layer and superior transfer learning compared to CNN-based models; trained on both ImageNet-1k (1.3M images) and ImageNet-21k (14M images) for enhanced feature representations

vs others: Outperforms ResNet-50 and EfficientNet-B0 on ImageNet accuracy (84.0% vs 76.1% and 77.1%) while maintaining comparable inference speed, and provides better transfer learning performance on downstream tasks due to transformer's global attention mechanism

4

table-transformer-structure-recognition-v1.1-allModel51/100

via “table-structure-detection-via-object-detection”

object-detection model by undefined. 16,19,098 downloads.

Unique: Uses DETR (Detection Transformer) architecture with a ResNet-50 backbone pre-trained on PubTabNet, enabling end-to-end learnable detection of table structure without hand-crafted features or region proposal networks. The transformer decoder directly predicts structured table elements (cells, rows, columns, headers) as discrete objects rather than treating table detection as a segmentation or heuristic-based problem.

vs others: Outperforms rule-based and Faster R-CNN approaches on complex table layouts because transformer attention mechanisms capture long-range spatial relationships between table elements, achieving higher mAP on PubTabNet benchmark than prior CNN-based methods.

5

table-transformer-structure-recognitionModel51/100

via “table-structure-detection-via-object-detection”

object-detection model by undefined. 13,26,815 downloads.

Unique: Uses DETR (Detection Transformer) architecture with a CNN backbone and transformer encoder-decoder, enabling end-to-end table structure detection without hand-crafted features or region proposal networks. Trained specifically on table structure annotations rather than generic object detection datasets, making it structurally aware of table-specific patterns like cell alignment and hierarchical row/column relationships.

vs others: More accurate than rule-based or heuristic table detection (line-following, grid detection) because it learns semantic table structure; faster inference than Faster R-CNN variants due to transformer efficiency; more specialized than generic object detectors (YOLO, Faster R-CNN) which lack table-specific training

6

RMBG-2.0Model47/100

via “semantic-aware background segmentation with transformer architecture”

image-segmentation model by undefined. 5,44,032 downloads.

Unique: Implements a modern transformer-based segmentation architecture (likely DETR-style or ViT-based encoder-decoder) instead of traditional U-Net CNNs, enabling better generalization across diverse image types and improved handling of complex boundaries through attention mechanisms that model long-range dependencies

vs others: Outperforms traditional background removal tools (like rembg v1 or OpenCV GrabCut) on complex subjects with fine details because transformer attention captures semantic context globally rather than relying on local color/edge cues

7

yolos-smallModel46/100

via “vision transformer-based object detection with patch tokenization”

object-detection model by undefined. 7,35,352 downloads.

Unique: Uses pure Vision Transformer architecture with patch-based tokenization (no CNN backbone) for object detection, treating detection as a sequence-to-sequence task rather than region-proposal-based approach. Implements efficient attention mechanisms that scale better to high-resolution images than traditional ViT by using adaptive patch merging.

vs others: Faster inference than standard ViT-based detectors due to optimized patch tokenization, but trades accuracy for speed compared to Faster R-CNN; better suited for edge deployment than Mask R-CNN while maintaining transformer composability with language models

8

detr-resnet-50Model45/100

via “end-to-end transformer-based object detection with resnet-50 backbone”

object-detection model by undefined. 2,39,063 downloads.

Unique: DETR (Detection Transformer) eliminates hand-designed detection components (anchors, NMS) by formulating detection as a set prediction problem with bipartite matching, using a pure transformer encoder-decoder on top of ResNet-50 features rather than region proposal networks or anchor grids

vs others: Simpler architecture than Faster R-CNN (no RPN, no NMS) and more interpretable than YOLO, but slower inference and weaker small-object detection make it better suited for research and moderate-latency applications than production real-time systems

9

detr-doc-table-detectionModel44/100

via “document table detection via transformer-based object localization”

object-detection model by undefined. 2,04,862 downloads.

Unique: Uses DETR's transformer-based set prediction approach instead of traditional anchor-based detectors (Faster R-CNN, YOLO), eliminating hand-crafted NMS and enabling direct end-to-end optimization for document table detection; fine-tuned specifically on ICDAR2019 document dataset rather than generic object detection datasets like COCO

vs others: Achieves higher precision on document tables than generic YOLO/Faster R-CNN models because it's domain-specialized on document layouts and uses transformer attention to reason about table structure globally rather than locally, though it trades inference speed for accuracy compared to lightweight YOLO variants

10

rtdetr_r18vd_coco_o365Model43/100

via “real-time object detection with transformer-based architecture”

object-detection model by undefined. 5,21,638 downloads.

Unique: Uses transformer-based detection with anchor-free, NMS-free design (RT-DETR architecture) instead of traditional Faster R-CNN/YOLO CNN pipelines; eliminates hand-crafted anchor definitions and post-processing NMS, enabling end-to-end optimization and faster convergence during training

vs others: Faster inference than DETR variants and comparable to YOLOv8 while maintaining transformer interpretability; outperforms ResNet-50 Faster R-CNN on COCO at similar latency due to efficient attention mechanisms

11

segformer-b5-finetuned-ade-640-640Fine-tune43/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 61,096 downloads.

Unique: Uses SegFormer architecture with hierarchical transformer encoder (B5 variant with 48M parameters) and lightweight MLP decoder instead of dense convolutional decoders, enabling efficient multi-scale feature fusion without expensive upsampling operations. Fine-tuned on ADE20K's 150 semantic classes with 640x640 resolution optimization, achieving state-of-the-art mIoU on scene parsing benchmarks while maintaining inference efficiency.

vs others: Outperforms DeepLabV3+ and PSPNet on ADE20K scene parsing (mIoU ~50%) while using 3-5x fewer parameters due to transformer efficiency; faster inference than ViT-based segmentation approaches due to hierarchical design, but slower than lightweight MobileNet-based segmenters for resource-constrained deployment.

12

segformer-b1-finetuned-ade-512-512Fine-tune43/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 1,77,465 downloads.

Unique: Uses hierarchical vision transformer (SegFormer) with all-MLP decoder instead of convolutional decoders, enabling efficient multi-scale feature fusion without expensive upsampling operations. Fine-tuned on ADE20K's 150 semantic classes (vs COCO's 80 or Cityscapes' 19) providing richer scene understanding for indoor/outdoor environments.

vs others: Faster inference and lower memory than DeepLabv3+ (ResNet backbone) while maintaining competitive mIoU; more efficient than ViT-based segmentation due to hierarchical design; outperforms FCN/U-Net on complex scene parsing due to transformer's global receptive field.

13

yolos-tinyModel41/100

via “vision transformer-based object detection with attention-weighted region proposals”

object-detection model by undefined. 83,525 downloads.

Unique: Applies pure transformer architecture (DETR-style with learnable object queries) to object detection instead of CNN backbones, enabling attention-based spatial reasoning without region proposal networks; tiny variant achieves 5.4M parameters through aggressive model compression while maintaining COCO detection capability

vs others: Simpler architecture than Faster R-CNN (no RPN) and more parameter-efficient than standard ViT detectors, but slower inference than optimized YOLO v5/v8 on edge devices due to transformer computational overhead

14

detr-resnet-101Model41/100

via “end-to-end transformer-based object detection with resnet-101 backbone”

object-detection model by undefined. 63,737 downloads.

Unique: Uses transformer encoder-decoder with bipartite matching loss instead of anchor-based region proposals or sliding windows, eliminating hand-crafted NMS and enabling direct set prediction of objects as a sequence-to-sequence problem

vs others: Simpler pipeline than Faster R-CNN (no RPN, no NMS) and more interpretable than YOLO, but slower inference due to transformer quadratic complexity compared to single-stage detectors

15

rtdetr_r101vd_coco_o365Model40/100

via “real-time object detection with transformer-based architecture”

object-detection model by undefined. 1,21,720 downloads.

Unique: Uses transformer encoder-decoder architecture with direct set prediction (eliminating anchor boxes and NMS) combined with ResNet-101-VD backbone, achieving real-time performance through efficient attention mechanisms and hybrid CNN-transformer design that balances speed and accuracy across 365 object categories from Objects365 dataset

vs others: Faster than traditional Faster R-CNN/Mask R-CNN detectors (50-100ms vs 200-400ms) while maintaining higher accuracy than lightweight YOLO variants through transformer attention, and more practical for production than ViT-based detectors due to optimized backbone selection

16

rtdetr_r50vd_coco_o365Model39/100

via “real-time object detection with transformer-based architecture”

object-detection model by undefined. 80,830 downloads.

Unique: Uses transformer encoder-decoder architecture with deformable attention mechanisms instead of traditional CNN-based region proposal networks; eliminates anchor boxes and NMS post-processing, reducing inference pipeline complexity while maintaining real-time performance through efficient attention computation

vs others: Faster inference than Faster R-CNN (no RPN overhead) and simpler than YOLO (no anchor engineering), while maintaining transformer-based reasoning for improved generalization across diverse object scales and aspect ratios

17

rtdetr_v2_r18vdModel39/100

via “real-time object detection with deformable transformer attention”

object-detection model by undefined. 1,06,918 downloads.

Unique: Uses deformable transformer attention (sampling only task-relevant spatial regions) combined with ResNet-18 backbone for real-time inference, whereas standard DETR processes full feature maps with quadratic attention complexity. This architectural choice reduces FLOPs by ~40% compared to vanilla transformer detectors while maintaining anchor-free detection paradigm.

vs others: Faster than YOLOv8 on edge devices due to deformable attention efficiency, and more accurate than lightweight anchor-based detectors (MobileNet-SSD) because transformer attention captures long-range spatial relationships without hand-crafted anchor priors.

18

rtdetr_r50vdModel36/100

via “real-time object detection with deformable transformer architecture”

object-detection model by undefined. 32,868 downloads.

Unique: Uses deformable cross-attention instead of standard multi-head attention, allowing the model to dynamically sample only task-relevant spatial regions; combined with ResNet-50-VD backbone (a more efficient variant than standard ResNet-50), this achieves <100ms inference while maintaining COCO AP of 53.0+ without NMS post-processing

vs others: Faster inference than YOLOv8 on equivalent hardware (deformable attention vs dense convolution) and more accurate than EfficientDet-D0 on COCO while using fewer parameters than Faster R-CNN variants

19

detr-resnet-50-dc5Model35/100

object-detection model by undefined. 38,839 downloads.

Unique: Utilizes a unique end-to-end transformer architecture that eliminates the need for anchor boxes, making it simpler and more efficient for training.

vs others: More straightforward to implement and train compared to traditional object detection models like Faster R-CNN, which require complex anchor box configurations.

20

deformable-detrModel34/100

via “deformable object detection”

object-detection model by undefined. 27,497 downloads.

Unique: Incorporates deformable attention that adjusts to the spatial distribution of objects, enhancing detection in diverse scenarios compared to static attention mechanisms.

vs others: More adaptable to varying object shapes and sizes than traditional object detection models like Faster R-CNN due to its deformable attention mechanism.

Top Matches

Also Known As

Company