Visual Genome
DatasetFree108K images with dense scene graphs and 5.4M region descriptions.
Capabilities8 decomposed
scene-graph-based visual relationship extraction
Medium confidenceExtracts and structures semantic relationships between objects in images using scene graph representations where nodes are objects and edges encode spatial/semantic relationships (e.g., 'person sitting on bench', 'cup on table'). The dataset provides pre-annotated scene graphs for 108K images, enabling models to learn structured reasoning about object interactions rather than treating images as flat feature vectors. Each relationship is labeled with predicate types (spatial: 'on', 'under'; semantic: 'wearing', 'holding') and grounded to pixel coordinates.
Provides densely annotated scene graphs at scale (2.3M relationships across 108K images) with explicit predicate types and pixel-level grounding, enabling structured learning of visual relationships rather than implicit feature-based representations. Uses hierarchical annotation combining object-level, attribute-level, and relationship-level labels in a unified graph structure.
Richer than COCO (object detection only) and more structured than ImageNet (no relationship annotations); enables training models that reason about object interactions, not just recognition
dense-region-description-grounding
Medium confidenceProvides 5.4 million natural language descriptions grounded to specific image regions (bounding boxes), enabling training of vision-language models that map text to visual regions. Each region description is manually written by annotators and linked to pixel coordinates, creating a dense supervision signal for learning region-text alignment. Descriptions range from simple object names to complex compositional descriptions capturing attributes, actions, and relationships.
Provides 5.4M region descriptions with pixel-level grounding across 108K images, creating dense supervision for learning fine-grained region-text alignment. Uses multi-annotator consensus for quality control and covers diverse object categories, attributes, and compositional descriptions.
Denser and more diverse than Flickr30K (158K descriptions) and provides explicit region coordinates unlike raw image-caption pairs; enables training region-grounding models at scale
visual-question-answering-dataset-with-scene-context
Medium confidenceContains 1.7 million visual question-answer pairs grounded in scene context, where questions reference objects, relationships, and attributes visible in images. Questions are paired with images and scene graphs, enabling models to learn to answer questions by reasoning over visual structure rather than pattern-matching. Answer types range from simple object names to complex compositional answers requiring multi-step reasoning over relationships.
Integrates 1.7M QA pairs with scene graph annotations, enabling models to learn reasoning over structured visual knowledge rather than image-level features alone. Questions are grounded in specific objects and relationships, creating a tighter coupling between language and visual structure.
Larger and more structured than VQA v2 (1.1M questions) and includes scene graph grounding unlike standard VQA datasets; enables training models that reason over visual relationships
object-instance-detection-with-dense-attributes
Medium confidenceProvides 3.8 million annotated object instances with bounding boxes, class labels, and 2.8 million attribute annotations (e.g., color, material, size, state). Each object is labeled with multiple attributes describing its visual properties, enabling training of models that predict not just object categories but fine-grained visual properties. Attributes are structured as key-value pairs (e.g., 'color: red', 'material: wood') and grounded to specific object instances.
Combines 3.8M object instances with 2.8M attribute annotations in a unified dataset, enabling training of attribute-aware detection models. Attributes are structured as key-value pairs and grounded to specific instances, creating dense supervision for learning visual properties beyond category labels.
Richer attribute annotations than COCO (which has minimal attributes) and larger scale than fine-grained datasets like CUB-200 (11K images); enables training attribute-aware detection at scale
multimodal-dataset-integration-for-vision-language-models
Medium confidenceIntegrates images, scene graphs, region descriptions, object attributes, and QA pairs into a unified multimodal dataset, enabling end-to-end training of vision-language models that learn from multiple supervision signals simultaneously. The dataset structure allows models to leverage complementary annotations (e.g., region descriptions for grounding, scene graphs for reasoning, attributes for fine-grained understanding) in a single training pipeline. Supports multi-task learning where models jointly optimize for detection, grounding, VQA, and relationship prediction.
Provides unified integration of 5 complementary annotation types (scene graphs, region descriptions, object instances, attributes, QA pairs) across 108K images, enabling multi-task learning from diverse supervision signals. Dataset structure supports joint optimization for detection, grounding, reasoning, and attribute prediction in a single training pipeline.
More comprehensive than single-task datasets (COCO, Flickr30K) and enables multi-task learning unlike datasets with isolated annotation types; supports training unified models that leverage complementary supervision signals
scene-graph-based-image-retrieval-and-indexing
Medium confidenceEnables indexing and retrieval of images based on scene graph structure and relationships, allowing queries like 'find images with a person sitting on a bench' or 'images where a dog is next to a car'. Scene graphs are indexed as structured knowledge representations, supporting semantic search over visual relationships rather than keyword matching. Retrieval can be performed by querying for specific objects, relationships, or relationship patterns.
Provides 2.3M annotated relationships indexed as scene graphs, enabling structured retrieval by visual relationships and spatial configurations. Supports querying by relationship patterns (e.g., 'X on Y') rather than keyword matching, enabling semantic search over visual structure.
Enables relationship-based retrieval unlike keyword-based image search; supports complex spatial/semantic queries that text-based systems cannot express
visual-relationship-distribution-analysis-and-statistics
Medium confidenceProvides statistical analysis and distribution information about visual relationships, objects, and attributes across the dataset, enabling researchers to understand frequency patterns, co-occurrence statistics, and relationship distributions. Includes statistics on predicate frequencies, object co-occurrence patterns, attribute distributions, and relationship types. Enables analysis of visual knowledge biases and patterns in the dataset.
Provides comprehensive statistical analysis of 2.3M relationships, 3.8M objects, and 2.8M attributes across 108K images, enabling researchers to understand visual knowledge distributions and dataset biases. Includes frequency statistics, co-occurrence patterns, and relationship type distributions.
Enables large-scale statistical analysis of visual relationships unlike smaller datasets; provides insights into relationship distributions and biases for improving model training
compositional-visual-understanding-through-structured-annotations
Medium confidenceEnables training of compositional visual understanding models by providing structured annotations that decompose images into objects, attributes, and relationships. Models can learn to compose understanding from parts (objects + attributes + relationships) rather than treating images as monolithic wholes. Supports learning of compositional generalization where models understand novel combinations of known objects and relationships.
Provides explicit decomposition of images into objects, attributes, and relationships, enabling training of compositional models that understand visual scenes through structured components. Scene graphs naturally support compositional learning by representing images as compositions of objects and relationships.
Enables compositional learning unlike flat image-label datasets; supports training models that generalize to novel combinations of known components
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Visual Genome, ranked by overlap. Discovered automatically through the match graph.
Arcee AI: Spotlight
Spotlight is a 7‑billion‑parameter vision‑language model derived from Qwen 2.5‑VL and fine‑tuned by Arcee AI for tight image‑text grounding tasks. It offers a 32 k‑token context window, enabling rich multimodal...
Qwen: Qwen3 VL 8B Instruct
Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built for high-fidelity understanding and reasoning across text, images, and video. It features improved multimodal fusion with Interleaved-MRoPE for long-horizon...
Qwen: Qwen3 VL 30B A3B Instruct
Qwen3-VL-30B-A3B-Instruct is a multimodal model that unifies strong text generation with visual understanding for images and videos. Its Instruct variant optimizes instruction-following for general multimodal tasks. It excels in perception...
Qwen: Qwen3 VL 32B Instruct
Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text...
Qwen: Qwen3 VL 8B Thinking
Qwen3-VL-8B-Thinking is the reasoning-optimized variant of the Qwen3-VL-8B multimodal model, designed for advanced visual and textual reasoning across complex scenes, documents, and temporal sequences. It integrates enhanced multimodal alignment and...
Mistral: Pixtral Large 2411
Pixtral Large is a 124B parameter, open-weight, multimodal model built on top of [Mistral Large 2](/mistralai/mistral-large-2411). The model is able to understand documents, charts and natural images. The model is...
Best For
- ✓Computer vision researchers building scene understanding models
- ✓Teams developing visual reasoning and VQA systems
- ✓ML engineers training multimodal models requiring structured visual knowledge
- ✓Researchers building vision-language foundation models (CLIP-style architectures)
- ✓Teams developing region-based visual understanding systems
- ✓ML engineers training dense visual grounding models
- ✓Researchers developing visual reasoning and VQA models
- ✓Teams building multimodal AI systems requiring visual understanding
Known Limitations
- ⚠Scene graphs are manually annotated, introducing subjective bias in relationship definitions and predicate selection
- ⚠Predicate vocabulary is limited to ~100 relationship types, may not capture domain-specific relationships
- ⚠Annotation coverage is uneven — some images have dense relationship annotations while others are sparse
- ⚠Relationships are binary (between two objects) — does not capture n-ary relationships or complex spatial configurations
- ⚠Region descriptions are subjective and vary in length/detail across annotators, introducing inconsistency
- ⚠Regions are rectangular bounding boxes, not semantic segmentation masks — cannot capture non-rectangular objects
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Dense visual knowledge dataset containing 108,077 images with 5.4 million region descriptions, 1.7 million visual QA pairs, 3.8 million object instances, 2.8 million attributes, and 2.3 million relationships between objects. Each image is annotated with scene graphs connecting objects through spatial and semantic relationships. Critical for training models that understand not just what objects are in an image but how they relate to each other. Foundational for visual reasoning and scene understanding research.
Categories
Alternatives to Visual Genome
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of Visual Genome?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →