Instruction Guided Editing With Text Based Spatial Control

1

Awesome-Video-Diffusion-ModelsRepository42/100

via “text-guided-video-editing-method-catalog”

[CSUR] A Survey on Video Diffusion Models

Unique: Explicitly separates text-guided video editing from text-to-video generation, recognizing that editing existing video content requires different architectural approaches (e.g., preserving unedited regions, maintaining temporal consistency across edits) than generating video from scratch. This distinction helps practitioners understand which methods apply to their use case.

vs others: More focused than generic 'video diffusion' categorization; provides explicit organization of editing-specific methods rather than requiring practitioners to filter through generation approaches

2

BrushNetModel35/100

via “instruction-guided editing with text-based spatial control”

[ECCV 2024] The official implementation of paper "BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion"

Unique: Combines text-guided inpainting with instruction parsing and spatial reasoning to enable high-level editing commands without manual mask drawing, using auxiliary models for object detection/segmentation to convert natural language into spatial masks.

vs others: More user-friendly than manual mask drawing while maintaining precise control through text instructions; leverages BrushNet's text-guided capabilities with automated mask generation, unlike simple inpainting tools that require manual mask creation.

3

PromptEnhancerPrompt35/100

via “vision-language image-to-image editing instruction refinement”

[CVPR 2026] PromptEnhancer is a prompt-rewriting tool, refining prompts into clearer, structured versions for better image generation.

Unique: Implements multi-modal chain-of-thought reasoning that jointly analyzes image content and editing instructions, grounding the instruction refinement in actual visual elements rather than processing text in isolation. This enables spatial awareness and visual context integration that text-only prompt enhancement cannot achieve.

vs others: Produces more spatially-aware and visually-grounded editing instructions than text-only prompt enhancement because it analyzes the actual image content, reducing ambiguity and improving downstream image-to-image model performance on complex edits.

4

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon)Product25/100

via “language-guided image editing with instruction following”

* ⏫ 07/2023: [Meta-Transformer: A Unified Framework for Multimodal Learning (Meta-Transformer)](https://arxiv.org/abs/2307.10802)

Unique: Performs language-guided editing within the unified decoder by conditioning on both image and text tokens, enabling instruction-based editing without separate mask inputs or specialized editing architectures

vs others: More intuitive than mask-based editing because it uses natural language instructions; more flexible than ControlNet because it doesn't require precise spatial control inputs

5

Imagic: Text-Based Real Image Editing with Diffusion Models (Imagic)Product18/100

via “text-guided iterative image editing via embedding interpolation”

* ⭐ 11/2022: [Visual Prompt Tuning](https://link.springer.com/chapter/10.1007/978-3-031-19827-4_41)

Unique: Uses embedding-space interpolation rather than pixel-space blending or mask-based compositing, enabling semantic edits that respect the diffusion model's learned feature space. The edit strength parameter provides intuitive control over edit magnitude without requiring architectural changes or per-edit retraining.

vs others: Produces more semantically coherent edits than naive text-to-image generation because it preserves the original image structure through the inversion and interpolation process, while offering more control than simple blending-based approaches.

Top Matches

Also Known As

Company