Visual Genome
DatasetFree108K images with dense scene graphs and 5.4M region descriptions.
Capabilities8 decomposed
scene-graph-based visual relationship annotation
Medium confidenceProvides structured scene graph representations where objects are nodes and relationships are directed edges encoding spatial and semantic connections between object instances. Each scene graph maps object instances to attributes and relationships using (subject, predicate, object) triple format, enabling models to learn not just object detection but compositional understanding of how objects interact and relate within images. Scene graphs are grounded to Wordnet synsets for semantic consistency across the dataset.
Uses directed scene graphs with Wordnet synset grounding as the primary organizational mechanism, enabling semantic alignment across datasets and compositional reasoning about object interactions. This graph-based approach differs from flat object detection datasets by explicitly modeling relationships as first-class entities with their own vocabulary.
Captures explicit relationship semantics that flat object detection datasets (COCO, ImageNet) cannot represent, enabling training of relationship prediction models that understand not just what objects exist but how they spatially and semantically relate to each other.
region-level dense visual description annotation
Medium confidenceProvides 5.4 million natural language descriptions of image regions, where each region is grounded to a bounding box and described in free-form text. This enables training of vision-language models that can generate or understand fine-grained descriptions of specific image areas rather than just whole-image captions. Descriptions are collected through crowdsourcing and provide diverse linguistic expressions for the same visual content.
Provides 5.4M region-level descriptions grounded to bounding boxes, enabling fine-grained vision-language alignment at the region level rather than image level. This dense annotation approach allows models to learn the relationship between specific image regions and their linguistic descriptions.
Offers region-level description density that exceeds COCO Captions (which provides 5 whole-image captions per image) by providing multiple descriptions per region, enabling training of models that understand fine-grained visual-linguistic correspondence.
object-instance localization and attribute assignment
Medium confidenceProvides 3.8 million object instances with precise bounding box localization and 2.8 million attribute assignments that tag visual properties of those objects. Each object instance is localized with a bounding box and assigned multiple attributes (e.g., color, size, material, state) from a controlled vocabulary. Attributes are grounded to Wordnet synsets, enabling semantic consistency and cross-dataset alignment of attribute meanings.
Combines 3.8M object instances with 2.8M attribute assignments grounded to Wordnet synsets, providing semantic consistency for attribute meanings across the dataset. This enables training models that understand not just object categories but their visual properties as semantic concepts.
Provides richer attribute annotations than COCO (which has minimal attribute data) and grounds attributes to Wordnet for semantic alignment, enabling attribute prediction models that generalize across datasets through shared semantic representations.
visual question-answering pair collection
Medium confidenceProvides 1.7 million visual question-answer pairs where questions are grounded in specific images and answers are derived from the image content and scene graph annotations. QA pairs cover diverse question types (object presence, counting, spatial relationships, attributes, relationships) and are collected through crowdsourcing. Questions are linked to specific regions or objects in the image, enabling training of visually-grounded QA systems.
Provides 1.7M QA pairs grounded in images with scene graph annotations, enabling training of VQA systems that can leverage structured relationship information to answer questions about object interactions and spatial configurations. Questions are linked to specific image regions, enabling region-grounded reasoning.
Offers larger scale and richer grounding than earlier VQA datasets (VQA v1/v2) by integrating QA pairs with scene graph annotations, enabling training of models that can perform structured reasoning about relationships and attributes.
wordnet-grounded semantic concept alignment
Medium confidenceAll annotated concepts (objects, attributes, relationships) are mapped to Wordnet synsets, providing semantic grounding that enables cross-dataset alignment and generalization. This mapping allows models trained on Visual Genome to leverage semantic relationships defined in Wordnet (hypernymy, meronymy, synonymy) and to transfer knowledge to other Wordnet-aligned datasets. Synset mapping provides a shared semantic vocabulary across different annotation types.
Provides systematic Wordnet synset grounding for all annotated concepts (objects, attributes, relationships), enabling semantic alignment across datasets and leveraging Wordnet's rich semantic relationships for generalization. This grounding approach differs from datasets that use flat label vocabularies without semantic structure.
Enables transfer learning and zero-shot generalization through Wordnet semantic relationships in ways that flat-vocabulary datasets (COCO, ImageNet) cannot support, allowing models to leverage hypernymy and other semantic relations for improved generalization.
large-scale crowdsourced annotation collection and curation
Medium confidenceManages collection and curation of 108,077 images with 5.4M region descriptions, 3.8M object instances, 2.8M attributes, 2.3M relationships, and 1.7M QA pairs through crowdsourcing workflows. The dataset represents a coordinated annotation effort across multiple annotation types, requiring quality control mechanisms, worker management, and inter-annotator agreement monitoring. Annotations are collected through structured crowdsourcing tasks with guidelines and validation procedures.
Coordinates collection of 5.4M region descriptions, 3.8M object instances, 2.8M attributes, 2.3M relationships, and 1.7M QA pairs across 108,077 images through integrated crowdsourcing workflows. This multi-type annotation coordination differs from single-task annotation datasets by requiring synchronized quality control across diverse annotation types.
Demonstrates feasibility of collecting multiple complementary annotation types (descriptions, objects, attributes, relationships, QA) at scale through coordinated crowdsourcing, whereas most datasets focus on single annotation types (COCO for captions, ImageNet for classification).
multi-modal visual-linguistic dataset for vision-language model training
Medium confidenceProvides integrated visual and linguistic data across 108,077 images with 5.4M region descriptions, 1.7M QA pairs, and structured scene graphs, enabling training of vision-language models that understand both visual content and natural language descriptions. The dataset supports multiple vision-language tasks (image captioning, visual grounding, VQA, relationship prediction) within a single coherent annotation framework. Linguistic descriptions are grounded to specific image regions and objects, enabling fine-grained visual-linguistic alignment.
Integrates region-level descriptions, scene graphs, and QA pairs within a single annotation framework, enabling vision-language models to learn fine-grained visual-linguistic alignment grounded to specific image regions and object relationships. This integrated approach differs from datasets that provide only whole-image captions or isolated QA pairs.
Provides richer multimodal grounding than COCO Captions (5 whole-image captions per image) through 5.4M region descriptions and scene graph relationships, enabling training of vision-language models that understand fine-grained visual-linguistic correspondence and object interactions.
structured scene understanding benchmark for visual reasoning
Medium confidenceProvides a comprehensive benchmark for evaluating visual reasoning systems through scene graphs, relationship prediction, attribute inference, and visual question-answering tasks. The dataset enables evaluation of models' ability to understand not just individual objects but their spatial and semantic relationships, compositional properties, and interactions. Scene graphs provide a structured representation for evaluating reasoning accuracy beyond object detection metrics.
Provides structured scene graph annotations that enable evaluation of visual reasoning beyond object detection, allowing assessment of models' ability to predict relationships, attributes, and answer complex questions about object interactions. This structured evaluation approach differs from image classification benchmarks.
Enables evaluation of relationship prediction and scene understanding that object detection benchmarks (COCO, ImageNet) cannot support, providing structured ground truth for assessing compositional visual reasoning capabilities.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Visual Genome, ranked by overlap. Discovered automatically through the match graph.
Qwen: Qwen3 VL 30B A3B Thinking
Qwen3-VL-30B-A3B-Thinking is a multimodal model that unifies strong text generation with visual understanding for images and videos. Its Thinking variant enhances reasoning in STEM, math, and complex tasks. It excels...
Florence-2
Microsoft's unified model for diverse vision tasks.
V7
AI Data Engine for Computer Vision & Generative...
Qwen: Qwen3 VL 8B Thinking
Qwen3-VL-8B-Thinking is the reasoning-optimized variant of the Qwen3-VL-8B multimodal model, designed for advanced visual and textual reasoning across complex scenes, documents, and temporal sequences. It integrates enhanced multimodal alignment and...
Qwen: Qwen3 VL 8B Instruct
Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built for high-fidelity understanding and reasoning across text, images, and video. It features improved multimodal fusion with Interleaved-MRoPE for long-horizon...
xAI: Grok 4
Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel tool calling, structured outputs, and both image and text inputs. Note that reasoning is not...
Best For
- ✓Computer vision researchers building scene understanding systems
- ✓ML practitioners training visual relationship detection models
- ✓Teams developing visual reasoning and VQA systems
- ✓Researchers studying compositional visual understanding
- ✓Researchers building dense image captioning and region description models
- ✓Teams developing vision-language alignment systems
- ✓ML practitioners training visual grounding models
- ✓Researchers studying fine-grained visual understanding
Known Limitations
- ⚠Scene graphs limited to pairwise object-relationship-object triples; no higher-order n-ary relationships documented
- ⚠Relationship vocabulary scope unknown — number of distinct relationship types not specified in documentation
- ⚠No inter-annotator agreement metrics or annotation error rates provided for relationship quality assessment
- ⚠Fixed dataset size of 108,077 images with unknown annotation coverage percentage per image
- ⚠Region descriptions are crowdsourced with unknown inter-annotator agreement or quality metrics
- ⚠No documentation of description length distribution, vocabulary complexity, or linguistic diversity
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Dense visual knowledge dataset containing 108,077 images with 5.4 million region descriptions, 1.7 million visual QA pairs, 3.8 million object instances, 2.8 million attributes, and 2.3 million relationships between objects. Each image is annotated with scene graphs connecting objects through spatial and semantic relationships. Critical for training models that understand not just what objects are in an image but how they relate to each other. Foundational for visual reasoning and scene understanding research.
Categories
Alternatives to Visual Genome
The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.
Compare →FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,
Compare →Are you the builder of Visual Genome?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →