Real Time Object Detection With Transformer Based Architecture

1

MediaPipeFramework60/100

via “object detection with bounding box localization”

Google's cross-platform on-device ML framework with pre-built solutions.

Unique: Provides unified object detection API across Android, iOS, Web, and Python with built-in support for multiple pre-trained models (COCO, Open Images) and custom model fine-tuning via Model Maker; uses hardware acceleration (GPU/NPU) on mobile platforms for real-time inference.

vs others: More mobile-optimized and faster than TensorFlow Object Detection API on edge devices, includes built-in model customization via Model Maker unlike many pre-trained-only alternatives, but less feature-rich than specialized object detection frameworks like YOLOv8 or Faster R-CNN.

2

MMDetectionRepository58/100

via “transformer-based detection with deformable attention and query optimization”

OpenMMLab detection toolbox with 300+ models.

Unique: Implements DINO (DETR with Improved deNoising) which adds contrastive learning between positive/negative queries and mixed query selection strategy, achieving state-of-the-art accuracy without hand-crafted components; deformable attention reduces complexity from O(n²) to O(n) by learning spatial offsets to relevant regions

vs others: More elegant than anchor-based detectors because it eliminates hand-crafted anchors and NMS; more efficient than vanilla DETR because deformable attention focuses on relevant regions; better convergence than early DETR variants due to contrastive learning and query optimization

3

table-transformer-detectionModel53/100

via “table-region detection in document images”

object-detection model by undefined. 33,94,499 downloads.

Unique: Uses a DETR (Detection Transformer) architecture specifically fine-tuned for table detection in documents, combining CNN visual feature extraction with transformer attention mechanisms to capture both local table structure and global document context. Unlike traditional region-proposal networks (Faster R-CNN), the transformer decoder directly predicts table locations without intermediate anchor generation, reducing false positives on document backgrounds.

vs others: Outperforms Faster R-CNN and SSD-based table detectors on mixed-content documents because transformer attention can distinguish table boundaries from surrounding text and whitespace more effectively, achieving higher precision on real-world scanned documents.

4

vit-base-patch16-224Model52/100

via “patch-based image classification with vision transformer architecture”

image-classification model by undefined. 47,71,224 downloads.

Unique: Uses pure transformer architecture (no convolutional layers) with learnable patch embeddings and positional encodings, enabling efficient global receptive field from the first layer and superior transfer learning compared to CNN-based models; trained on both ImageNet-1k (1.3M images) and ImageNet-21k (14M images) for enhanced feature representations

vs others: Outperforms ResNet-50 and EfficientNet-B0 on ImageNet accuracy (84.0% vs 76.1% and 77.1%) while maintaining comparable inference speed, and provides better transfer learning performance on downstream tasks due to transformer's global attention mechanism

5

mobilevit-smallModel48/100

via “lightweight mobile vision transformer image classification”

image-classification model by undefined. 27,81,568 downloads.

Unique: Uses a hybrid local-to-global architecture combining depthwise separable convolutions for local feature extraction with multi-head self-attention for global context, achieving 78.3% ImageNet-1k accuracy with 5.6M parameters — significantly smaller than ViT-Base (86M params) while maintaining transformer expressiveness for mobile deployment

vs others: Outperforms MobileNetV3 (77.2% accuracy) with comparable model size while offering superior transfer learning capabilities due to transformer components; lighter than EfficientNet-B0 (77.1%, 5.3M params) with better accuracy-to-latency tradeoff on ARM processors

6

RMBG-2.0Model47/100

via “semantic-aware background segmentation with transformer architecture”

image-segmentation model by undefined. 5,44,032 downloads.

Unique: Implements a modern transformer-based segmentation architecture (likely DETR-style or ViT-based encoder-decoder) instead of traditional U-Net CNNs, enabling better generalization across diverse image types and improved handling of complex boundaries through attention mechanisms that model long-range dependencies

vs others: Outperforms traditional background removal tools (like rembg v1 or OpenCV GrabCut) on complex subjects with fine details because transformer attention captures semantic context globally rather than relying on local color/edge cues

7

yolos-smallModel46/100

via “vision transformer-based object detection with patch tokenization”

object-detection model by undefined. 7,35,352 downloads.

Unique: Uses pure Vision Transformer architecture with patch-based tokenization (no CNN backbone) for object detection, treating detection as a sequence-to-sequence task rather than region-proposal-based approach. Implements efficient attention mechanisms that scale better to high-resolution images than traditional ViT by using adaptive patch merging.

vs others: Faster inference than standard ViT-based detectors due to optimized patch tokenization, but trades accuracy for speed compared to Faster R-CNN; better suited for edge deployment than Mask R-CNN while maintaining transformer composability with language models

8

detr-resnet-50Model45/100

via “end-to-end transformer-based object detection with resnet-50 backbone”

object-detection model by undefined. 2,39,063 downloads.

Unique: DETR (Detection Transformer) eliminates hand-designed detection components (anchors, NMS) by formulating detection as a set prediction problem with bipartite matching, using a pure transformer encoder-decoder on top of ResNet-50 features rather than region proposal networks or anchor grids

vs others: Simpler architecture than Faster R-CNN (no RPN, no NMS) and more interpretable than YOLO, but slower inference and weaker small-object detection make it better suited for research and moderate-latency applications than production real-time systems

9

Deepseek v4 peopleModel45/100

via “people detection and recognition”

Deepseek v4 people

Unique: Utilizes a hybrid architecture combining CNNs and transformers for enhanced accuracy in diverse conditions, unlike traditional models that rely solely on CNNs.

vs others: Offers superior accuracy in challenging environments compared to standard face recognition models, which often struggle with variations in lighting and angles.

10

segformer-b0-finetuned-ade-512-512Fine-tune45/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 5,08,692 downloads.

Unique: Lightweight B0 variant (3.7M parameters) with hierarchical transformer encoder enables efficient client-side inference via ONNX, avoiding cloud API calls; pre-quantized to 8-bit reduces model size to ~15MB while maintaining ADE20K accuracy within 2-3% of original

vs others: Smaller and faster than DeepLabV3+ (59M params) for browser deployment, more accurate than FCN-based segmentation on complex indoor scenes due to transformer attention, and open-source unlike proprietary cloud APIs (Google Vision, AWS Rekognition)

11

detr-doc-table-detectionModel44/100

via “document table detection via transformer-based object localization”

object-detection model by undefined. 2,04,862 downloads.

Unique: Uses DETR's transformer-based set prediction approach instead of traditional anchor-based detectors (Faster R-CNN, YOLO), eliminating hand-crafted NMS and enabling direct end-to-end optimization for document table detection; fine-tuned specifically on ICDAR2019 document dataset rather than generic object detection datasets like COCO

vs others: Achieves higher precision on document tables than generic YOLO/Faster R-CNN models because it's domain-specialized on document layouts and uses transformer attention to reason about table structure globally rather than locally, though it trades inference speed for accuracy compared to lightweight YOLO variants

12

rtdetr_r18vd_coco_o365Model43/100

via “real-time object detection with transformer-based architecture”

object-detection model by undefined. 5,21,638 downloads.

Unique: Uses transformer-based detection with anchor-free, NMS-free design (RT-DETR architecture) instead of traditional Faster R-CNN/YOLO CNN pipelines; eliminates hand-crafted anchor definitions and post-processing NMS, enabling end-to-end optimization and faster convergence during training

vs others: Faster inference than DETR variants and comparable to YOLOv8 while maintaining transformer interpretability; outperforms ResNet-50 Faster R-CNN on COCO at similar latency due to efficient attention mechanisms

13

segformer-b5-finetuned-ade-640-640Fine-tune43/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 61,096 downloads.

Unique: Uses SegFormer architecture with hierarchical transformer encoder (B5 variant with 48M parameters) and lightweight MLP decoder instead of dense convolutional decoders, enabling efficient multi-scale feature fusion without expensive upsampling operations. Fine-tuned on ADE20K's 150 semantic classes with 640x640 resolution optimization, achieving state-of-the-art mIoU on scene parsing benchmarks while maintaining inference efficiency.

vs others: Outperforms DeepLabV3+ and PSPNet on ADE20K scene parsing (mIoU ~50%) while using 3-5x fewer parameters due to transformer efficiency; faster inference than ViT-based segmentation approaches due to hierarchical design, but slower than lightweight MobileNet-based segmenters for resource-constrained deployment.

14

segformer-b1-finetuned-ade-512-512Fine-tune43/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 1,77,465 downloads.

Unique: Uses hierarchical vision transformer (SegFormer) with all-MLP decoder instead of convolutional decoders, enabling efficient multi-scale feature fusion without expensive upsampling operations. Fine-tuned on ADE20K's 150 semantic classes (vs COCO's 80 or Cityscapes' 19) providing richer scene understanding for indoor/outdoor environments.

vs others: Faster inference and lower memory than DeepLabv3+ (ResNet backbone) while maintaining competitive mIoU; more efficient than ViT-based segmentation due to hierarchical design; outperforms FCN/U-Net on complex scene parsing due to transformer's global receptive field.

15

yolov10sModel42/100

via “real-time multi-scale object detection with anchor-free architecture”

object-detection model by undefined. 2,23,706 downloads.

Unique: YOLOv10 introduces an anchor-free detection head with NMS-free training, eliminating the need for hand-crafted anchor boxes and post-processing NMS operations. This architectural shift reduces hyperparameter tuning surface and improves inference speed by ~20% vs YOLOv8 while maintaining competitive accuracy on COCO.

vs others: Faster than Faster R-CNN (two-stage) for real-time use cases and simpler to deploy than EfficientDet due to anchor-free design requiring no anchor configuration; trades some precision on tiny objects vs Mask R-CNN for speed-critical applications.

16

segformer-b2-finetuned-ade-512-512Fine-tune42/100

via “semantic-scene-segmentation-with-transformer-backbone”

image-segmentation model by undefined. 63,104 downloads.

Unique: Uses SegFormer's efficient hierarchical transformer encoder with linear projection decoder instead of dense convolutional decoders — reduces parameters by 90% vs DeepLabV3+ while maintaining competitive accuracy. Mix-transformer backbone progressively fuses multi-scale features without expensive upsampling operations, enabling faster inference on edge hardware.

vs others: Faster inference (2-3x speedup vs DeepLabV3+) with fewer parameters (27M vs 65M) while maintaining comparable mIoU on ADE20K, making it ideal for mobile/edge deployment where DeepLab variants are too heavy.

17

yolos-tinyModel41/100

via “vision transformer-based object detection with attention-weighted region proposals”

object-detection model by undefined. 83,525 downloads.

Unique: Applies pure transformer architecture (DETR-style with learnable object queries) to object detection instead of CNN backbones, enabling attention-based spatial reasoning without region proposal networks; tiny variant achieves 5.4M parameters through aggressive model compression while maintaining COCO detection capability

vs others: Simpler architecture than Faster R-CNN (no RPN) and more parameter-efficient than standard ViT detectors, but slower inference than optimized YOLO v5/v8 on edge devices due to transformer computational overhead

18

detr-resnet-101Model41/100

via “end-to-end transformer-based object detection with resnet-101 backbone”

object-detection model by undefined. 63,737 downloads.

Unique: Uses transformer encoder-decoder with bipartite matching loss instead of anchor-based region proposals or sliding windows, eliminating hand-crafted NMS and enabling direct set prediction of objects as a sequence-to-sequence problem

vs others: Simpler pipeline than Faster R-CNN (no RPN, no NMS) and more interpretable than YOLO, but slower inference due to transformer quadratic complexity compared to single-stage detectors

19

rtdetr_r101vd_coco_o365Model40/100

via “real-time object detection with transformer-based architecture”

object-detection model by undefined. 1,21,720 downloads.

Unique: Uses transformer encoder-decoder architecture with direct set prediction (eliminating anchor boxes and NMS) combined with ResNet-101-VD backbone, achieving real-time performance through efficient attention mechanisms and hybrid CNN-transformer design that balances speed and accuracy across 365 object categories from Objects365 dataset

vs others: Faster than traditional Faster R-CNN/Mask R-CNN detectors (50-100ms vs 200-400ms) while maintaining higher accuracy than lightweight YOLO variants through transformer attention, and more practical for production than ViT-based detectors due to optimized backbone selection

20

Anzhcs_YOLOsModel40/100

via “real-time multi-class object detection with bounding box localization”

object-detection model by undefined. 86,897 downloads.

Unique: Fine-tuned variant of Ultralytics YOLO11 base model specialized for art-domain object detection, inheriting YOLO11's architectural improvements (anchor-free detection, decoupled head design) while maintaining single-stage detection efficiency. Uses Ultralytics' native PyTorch implementation with built-in export support for ONNX, TensorRT, and CoreML for cross-platform deployment.

vs others: Faster inference than Faster R-CNN or Mask R-CNN (single-stage vs two-stage detection) with better art-domain accuracy than generic COCO-trained YOLOv8 due to fine-tuning on specialized data; lighter than Vision Transformers while maintaining competitive accuracy.

Top Matches

Also Known As

Company