BrushNet

RepositoryFree

[ECCV 2024] The official implementation of paper "BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion"

Open Source

/ 100

12 capabilities

Capabilities12 decomposed

decomposed dual-branch diffusion inpainting with masked feature separation

Medium confidence

Implements a specialized dual-branch architecture that separates masked image features from noisy latent features during the diffusion process, reducing the model's learning load and enabling precise inpainting. The architecture processes segmentation or random masks through dedicated branches that converge at multiple resolution levels, allowing the base diffusion model to focus on content generation within masked regions while preserving unmasked areas. This decomposition is achieved through custom UNet modifications in the diffusers library that inject BrushNet control at intermediate layers without requiring full model retraining.

Solves for

I want to inpaint masked regions in images while preserving surrounding context using text guidanceI need to integrate inpainting capabilities into existing Stable Diffusion pipelines without retraining from scratchI want to support both object-shaped segmentation masks and arbitrary random masks in the same modelI need fine-grained per-pixel control over the diffusion process for high-quality inpainting results

Best for

Computer vision researchers implementing plug-and-play diffusion extensions

ML engineers building image editing applications on top of Stable Diffusion

Teams requiring production-grade inpainting without full model retraining

Requires

Python 3.9+

PyTorch 1.12.1+

CUDA 11.6+ (recommended for GPU inference)

Limitations

Requires pre-trained base diffusion model (SD 1.5 or SDXL) — cannot function standalone

Inference latency depends on base model's diffusion steps (typically 50-100 steps for quality results)

Memory footprint scales with image resolution; 4K+ images may require gradient checkpointing or reduced batch sizes

What makes it unique

Uses decomposed dual-branch architecture with dense per-pixel control injected at multiple UNet resolution levels, enabling plug-and-play integration without modifying base model weights. Unlike naive masking approaches, separates masked feature processing from latent noise processing, reducing learning burden and improving boundary quality.

vs alternatives

Achieves higher inpainting quality than simple mask-based approaches (e.g., Inpaint-LoRA) while maintaining compatibility with any pre-trained diffusion model, and requires significantly less training data than full model fine-tuning approaches.

text-guided inpainting pipeline with multi-variant model support

Medium confidence

Provides unified inference pipelines (StableDiffusionBrushNetPipeline and StableDiffusionXLBrushNetPipeline) that orchestrate the complete inpainting workflow: text encoding via CLIP/OpenCLIP, mask preprocessing, latent encoding of the original image, iterative diffusion with BrushNet control injection, and final decoding. The pipeline abstracts away the complexity of managing multiple model components (text encoder, VAE, UNet, scheduler) and provides a simple API while supporting both SD 1.5 and SDXL base models with separate segmentation and random mask variants.

Solves for

I want a simple, high-level API to perform text-guided inpainting without managing diffusion internalsI need to switch between SD 1.5 and SDXL base models without changing application codeI want to control inference parameters like guidance scale, number of steps, and random seedI need to batch process multiple images with different masks and prompts efficiently

Best for

Application developers building image editing UIs or APIs

Data scientists prototyping inpainting workflows

Teams deploying inpainting as a microservice

Requires

Python 3.9+

PyTorch 1.12.1+

Hugging Face transformers library (for CLIP text encoder)

Limitations

Pipeline initialization loads all model components into memory (~7GB for SD 1.5, ~13GB for SDXL) — requires GPU with sufficient VRAM

Sequential processing of batches; no built-in distributed inference across multiple GPUs

Text encoding is fixed to CLIP tokenizer (77 tokens for SD 1.5); longer prompts are truncated

What makes it unique

Provides unified pipeline abstraction that handles model variant selection (SD 1.5 vs SDXL, segmentation vs random mask) and component orchestration transparently, with built-in support for both guidance scale and negative prompts for fine-grained control over generation quality.

vs alternatives

Simpler API than raw diffusers pipeline usage while maintaining full control over inference parameters; supports both SD 1.5 and SDXL without code changes, unlike single-model implementations.

model weight quantization and optimization for deployment

Medium confidence

Provides tools for reducing model size and inference latency through quantization (INT8, FP16) and optimization techniques. The system supports post-training quantization of BrushNet weights, mixed-precision inference (FP16 for forward pass, FP32 for critical operations), and optional pruning of less important weights. Quantized models achieve 2-4x speedup with minimal quality loss, enabling deployment on resource-constrained devices (edge GPUs, mobile) or higher throughput on servers.

Solves for

I want to reduce model size for deployment on edge devices or mobileI need to increase inference throughput on servers with limited GPU memoryI want to minimize latency for real-time interactive applicationsI need to balance quality and performance for production deployments

Best for

ML engineers optimizing models for production deployment

Teams building edge AI applications with resource constraints

Organizations requiring high-throughput inference on limited hardware

Requires

PyTorch with quantization support (torch.quantization)

Optional: TensorRT or ONNX for advanced optimization

Calibration dataset for post-training quantization (typically 100-500 images)

Limitations

INT8 quantization can cause 2-5% quality degradation (LPIPS) compared to FP32; requires validation on target domain

Quantization is model-specific; quantized weights cannot be transferred between different base models (SD 1.5 vs SDXL)

FP16 mixed precision may cause numerical instability in some operations; requires careful testing

What makes it unique

Provides integrated quantization pipeline with quality validation and performance benchmarking, supporting multiple quantization strategies (INT8, FP16, dynamic) with automatic calibration and fallback mechanisms for numerical stability.

vs alternatives

Simpler than manual quantization using TensorRT or ONNX while maintaining quality validation; supports multiple quantization types with automatic selection based on target device, unlike single-strategy approaches.

integration with huggingface diffusers ecosystem

Medium confidence

Provides seamless integration with the HuggingFace diffusers library, enabling BrushNet to work with any diffusers-compatible scheduler, pipeline, and model. The integration includes custom BrushNet model classes (BrushNetModel) that inherit from diffusers base classes, custom pipeline classes (StableDiffusionBrushNetPipeline) that follow diffusers conventions, and compatibility with diffusers utilities (safety checker, feature extractor). This enables users to leverage the entire diffusers ecosystem (LoRA, ControlNet, other extensions) alongside BrushNet.

Solves for

I want to use BrushNet with different schedulers (DDIM, Euler, DPM++) without code changesI need to combine BrushNet with other diffusers extensions (LoRA, ControlNet, safety checkers)I want to load/save BrushNet models using standard diffusers APIsI need to integrate BrushNet into existing diffusers-based applications

Best for

Developers already using HuggingFace diffusers in their projects

Teams building modular diffusion pipelines with multiple extensions

Researchers combining BrushNet with other diffusion techniques

Requires

HuggingFace diffusers library (0.21.0+)

PyTorch 1.12.1+

Transformers library for CLIP text encoder

Limitations

Integration is limited to diffusers API conventions; custom diffusion implementations may not be compatible

Some diffusers features (e.g., IP-Adapter, certain LoRA variants) may not work seamlessly with BrushNet without additional integration work

Scheduler compatibility depends on diffusers version; older versions may lack support for newer schedulers

What makes it unique

Implements BrushNet as native diffusers components (BrushNetModel, custom pipelines) following diffusers conventions, enabling seamless composition with other diffusers extensions and schedulers without wrapper layers or compatibility shims.

vs alternatives

Tighter integration than wrapper-based approaches; BrushNet components inherit from diffusers base classes, enabling direct use of diffusers utilities and compatibility with the broader ecosystem, unlike standalone implementations.

mask-aware latent encoding and feature extraction

Medium confidence

Preprocesses input images and masks into latent space representations that preserve spatial information about masked vs unmasked regions. The system encodes the original image through the VAE encoder, then applies mask-aware feature extraction that separates masked image features from the noisy latent representation. This preprocessing step is critical for the dual-branch architecture, as it ensures the BrushNet model receives properly formatted input that distinguishes between regions to inpaint and regions to preserve, using spatial masking operations at the latent level (typically 8x downsampled from image space).

Solves for

I need to convert high-resolution images and masks into latent representations for efficient diffusion processingI want to ensure masked regions are properly isolated in latent space before diffusion beginsI need to handle variable image sizes and aspect ratios while maintaining mask alignmentI want to extract features from both masked and unmasked regions for the dual-branch architecture

Best for

ML engineers optimizing inference performance by working in latent space

Researchers studying diffusion model behavior in latent representations

Teams building custom inpainting pipelines with specialized preprocessing needs

Requires

Pre-trained VAE encoder (included in SD 1.5 or SDXL model)

PyTorch with CUDA support for efficient tensor operations

Input image dimensions must be multiples of 8 (e.g., 512x512, 768x768)

Limitations

VAE encoding introduces quantization artifacts due to 8x spatial downsampling — fine details in masks may be lost

Mask must be resized to match latent dimensions (typically 64x64 for 512x512 images); interpolation can introduce boundary artifacts

Latent space representation is model-specific (SD 1.5 VAE differs from SDXL VAE) — cannot transfer latents between models

What makes it unique

Implements mask-aware latent extraction that preserves spatial masking information through the VAE encoding process, using dual-branch feature separation at latent level rather than image level, enabling efficient per-pixel control without full image-resolution processing.

vs alternatives

More efficient than image-space masking because it operates on 8x downsampled latents, reducing memory and compute requirements while maintaining spatial precision through dedicated mask channels in the latent representation.

multi-resolution dense per-pixel control injection

Medium confidence

Injects BrushNet control signals at multiple UNet resolution levels (typically 4 scales: 64x64, 32x32, 16x16, 8x8) to provide fine-grained guidance over the diffusion process. The control mechanism works by modifying the UNet's cross-attention and self-attention layers with BrushNet-specific conditioning that incorporates mask information and masked image features at each resolution. This multi-scale injection ensures that both coarse structure (from low-resolution features) and fine details (from high-resolution features) are properly controlled, enabling precise inpainting without affecting unmasked regions.

Solves for

I want to guide diffusion at multiple scales to ensure both structural coherence and fine detail qualityI need to prevent the diffusion process from modifying unmasked regions while inpainting masked areasI want to inject spatial control without modifying the base UNet architectureI need to balance inpainting quality with computational efficiency across different image resolutions

Best for

Researchers studying multi-scale diffusion control mechanisms

ML engineers implementing custom diffusion guidance strategies

Teams requiring fine-grained control over inpainting quality and boundary preservation

Requires

Modified UNet implementation with BrushNet control injection points

Mask and masked image features preprocessed at multiple resolutions

PyTorch with support for custom attention module modifications

Limitations

Multi-scale injection increases inference latency by ~15-25% compared to single-scale approaches due to additional feature processing

Requires careful tuning of control weights at each resolution level — improper weighting can cause artifacts or loss of detail

Control injection modifies attention patterns, which may affect generation diversity or introduce mode collapse at certain guidance scales

What makes it unique

Implements dense per-pixel control through multi-resolution feature injection at 4 UNet scales simultaneously, using decomposed masked image features rather than simple concatenation, enabling structural guidance without sacrificing fine detail quality or affecting unmasked regions.

vs alternatives

Provides finer spatial control than single-scale guidance (e.g., ControlNet) while maintaining compatibility with pre-trained models; multi-scale approach ensures both coarse structure and fine details are properly guided, unlike naive mask-based approaches that only work at one resolution.

segmentation and random mask variant support

Medium confidence

Provides separate model variants optimized for two distinct mask types: segmentation masks (clean, object-shaped boundaries) and random masks (arbitrary, potentially irregular shapes). Each variant is trained with different mask distributions and augmentation strategies to handle the specific characteristics of its target mask type. The system automatically selects the appropriate variant based on mask properties or allows explicit selection, enabling optimal inpainting quality for different use cases without requiring users to understand the underlying mask type differences.

Solves for

I want to inpaint objects defined by clean segmentation masks (e.g., from semantic segmentation models)I need to handle arbitrary, user-drawn masks with irregular boundariesI want the model to automatically adapt to different mask types without manual configurationI need to support both precise object replacement and freeform content generation

Best for

Computer vision applications using semantic segmentation for object removal/replacement

Interactive image editing tools where users draw arbitrary masks

Production systems requiring robust handling of diverse mask types

Requires

Separate pre-trained model weights for segmentation and random mask variants

Mask preprocessing to ensure proper format (binary or soft mask, normalized to [0, 1])

Limitations

Using wrong variant for mask type (e.g., segmentation variant on random mask) degrades quality by ~10-15% LPIPS

Segmentation variant assumes clean boundaries; noisy or anti-aliased mask edges may produce artifacts

Random mask variant may over-smooth boundaries on clean segmentation masks, losing precision

What makes it unique

Provides separate trained variants for segmentation vs random masks rather than single unified model, with each variant optimized for its mask type's specific characteristics through targeted training data augmentation and loss weighting strategies.

vs alternatives

Achieves better quality than single-model approaches by training separately for each mask type's distribution; segmentation variant produces cleaner object boundaries while random variant handles freeform masks without over-smoothing, unlike generic inpainting models.

training pipeline with dataset preparation and augmentation

Medium confidence

Provides end-to-end training infrastructure for fine-tuning BrushNet on custom datasets, including dataset loading, mask generation/augmentation, and training loop management. The training system supports both SD 1.5 and SDXL base models with separate training scripts, implements mask augmentation strategies (random mask generation, boundary noise, dilation/erosion), and uses mixed-precision training with gradient accumulation for memory efficiency. Training can be performed on standard datasets (Places, CelebA-HQ) or custom image collections, with support for distributed training across multiple GPUs.

Solves for

I want to fine-tune BrushNet on domain-specific images (e.g., medical images, product photos)I need to train separate models for segmentation vs random masks with appropriate augmentationI want to optimize training for limited GPU memory using gradient accumulation and mixed precisionI need to evaluate training progress with standard metrics (LPIPS, FID, SSIM)

Best for

ML engineers building domain-specific inpainting models

Research teams experimenting with BrushNet variants

Organizations with custom datasets requiring specialized inpainting models

Requires

Python 3.9+

PyTorch 1.12.1+ with CUDA support

GPU with 24GB+ VRAM (A100 or equivalent recommended)

Limitations

Training requires 24-48 hours on single A100 GPU for convergence; multi-GPU training setup is non-trivial

Requires large-scale image datasets (100k+ images recommended) for stable training; smaller datasets may overfit

Mask augmentation strategies must be carefully tuned for target use case; poor augmentation leads to poor generalization

What makes it unique

Implements mask-type-specific training pipelines with separate augmentation strategies for segmentation vs random masks, using mixed-precision training and gradient accumulation to fit on consumer GPUs while maintaining convergence quality comparable to full-precision training.

vs alternatives

Provides complete training infrastructure including dataset preparation and augmentation, unlike inference-only implementations; supports both SD 1.5 and SDXL with separate optimized training scripts, enabling domain-specific model adaptation without external training frameworks.

evaluation metrics computation (lpips, fid, ssim)

Medium confidence

Computes standard image quality metrics for evaluating inpainting results: LPIPS (learned perceptual image patch similarity) for perceptual quality, FID (Fréchet Inception Distance) for distribution matching, and SSIM (structural similarity) for pixel-level fidelity. The evaluation system loads pre-trained feature extractors (InceptionV3 for FID, AlexNet for LPIPS) and compares generated inpainted images against ground truth or reference images. Results are aggregated across test sets and reported with statistical summaries (mean, std, percentiles).

Solves for

I want to quantitatively evaluate inpainting quality using standard computer vision metricsI need to compare different BrushNet variants or base models objectivelyI want to track model performance improvements during training or fine-tuningI need to benchmark against other inpainting methods using comparable metrics

Best for

Researchers publishing inpainting papers with quantitative results

ML engineers comparing model variants during development

Teams establishing quality baselines for production models

Requires

PyTorch with torchvision (for pre-trained feature extractors)

Pre-trained InceptionV3 model (auto-downloaded from torchvision)

Test dataset with ground truth images and masks

Limitations

LPIPS and FID require pre-trained feature extractors (InceptionV3, AlexNet) which add ~2-5 seconds per image evaluation overhead

Metrics are sensitive to image preprocessing (normalization, resizing) — must match training/evaluation setup exactly

LPIPS and FID are not perfectly correlated with human perception; high metric scores don't guarantee visual quality

What makes it unique

Integrates three complementary metrics (perceptual LPIPS, distribution FID, and structural SSIM) with pre-trained feature extractors, providing both aggregate statistics and per-image scores for detailed analysis of inpainting quality across different aspects.

vs alternatives

Provides comprehensive evaluation using multiple metrics rather than single-metric approaches; includes both perceptual (LPIPS) and distribution-level (FID) metrics, enabling nuanced quality assessment compared to pixel-only metrics like SSIM.

gradio web interface for interactive inpainting

Medium confidence

Provides a browser-based interactive interface for real-time inpainting using Gradio, enabling users to upload images, draw masks, enter text prompts, and adjust inference parameters (guidance scale, steps) without coding. The interface handles image upload, mask drawing with canvas tools, prompt input, and displays results with latency information. The Gradio app wraps the inference pipeline and can be deployed locally or on cloud platforms (HuggingFace Spaces, Gradio Cloud) for easy sharing and collaboration.

Solves for

I want to test BrushNet inpainting without writing codeI need to share inpainting capabilities with non-technical usersI want to quickly iterate on prompts and masks to find optimal resultsI need to deploy inpainting as a web service for team collaboration

Best for

Non-technical users exploring inpainting capabilities

Product teams prototyping image editing features

Researchers sharing models with collaborators

Requires

Python 3.9+

Gradio library (pip install gradio)

Pre-trained BrushNet model weights

Limitations

Gradio interface adds ~500ms-1s overhead per request due to HTTP serialization and image encoding/decoding

Mask drawing tools are basic (brush, eraser) — complex masks are tedious to create; better suited for simple object removal

Single-user inference; concurrent requests queue sequentially on single GPU

What makes it unique

Provides lightweight Gradio-based web interface with integrated mask drawing canvas, parameter controls, and real-time inference feedback, enabling non-technical users to interact with BrushNet without API knowledge or local setup.

vs alternatives

Simpler to deploy than custom web frameworks (Flask, FastAPI) while maintaining full inference control; Gradio's automatic API generation enables easy integration with other tools, and built-in sharing features (HuggingFace Spaces) require no infrastructure setup.

instruction-guided editing with text-based spatial control

Medium confidence

Extends basic text-guided inpainting with instruction-based editing that interprets natural language instructions to automatically generate masks and guide inpainting. The system parses instructions like 'remove the person on the left' or 'replace the sky with clouds' to identify regions of interest and apply appropriate inpainting. This capability combines text understanding with spatial reasoning, potentially using auxiliary models (object detection, segmentation) to convert instructions into masks before applying BrushNet inpainting.

Solves for

I want to edit images using natural language instructions without manually drawing masksI need to automatically identify and inpaint specific objects mentioned in textI want to support complex editing tasks like 'replace X with Y' using semantic understandingI need to reduce user effort by automating mask generation from text descriptions

Best for

End-user image editing applications with natural language interfaces

Teams building AI-powered photo editing tools

Accessibility-focused applications where drawing masks is difficult

Requires

Pre-trained object detection or segmentation model (e.g., YOLO, SAM, Mask R-CNN)

Natural language processing for instruction parsing (rule-based or LLM-based)

BrushNet inpainting pipeline for final generation

Limitations

Instruction parsing requires additional models (object detection, segmentation) which add latency (~1-2 seconds per instruction)

Spatial understanding is limited to objects detectable by auxiliary models; abstract concepts ('make it more vibrant') cannot be spatially grounded

Instruction ambiguity can lead to incorrect mask generation; 'remove the person' may fail if multiple people are present

What makes it unique

Combines text-guided inpainting with instruction parsing and spatial reasoning to enable high-level editing commands without manual mask drawing, using auxiliary models for object detection/segmentation to convert natural language into spatial masks.

vs alternatives

More user-friendly than manual mask drawing while maintaining precise control through text instructions; leverages BrushNet's text-guided capabilities with automated mask generation, unlike simple inpainting tools that require manual mask creation.

batch processing with multi-image inpainting

Medium confidence

Enables efficient processing of multiple images with different masks and prompts in a single batch, optimizing GPU utilization and reducing per-image overhead. The batch processor handles variable image sizes through padding/resizing, manages memory efficiently with dynamic batching, and provides progress tracking and error handling for robust production use. Results are returned with metadata (processing time, success/failure status) for each image.

Solves for

I want to inpaint hundreds of images efficiently without sequential processingI need to optimize GPU utilization for production inpainting workloadsI want to process images with different sizes and aspect ratios in a single batchI need robust error handling and progress tracking for long-running batch jobs

Best for

Production systems processing large image collections

Data preprocessing pipelines requiring bulk inpainting

Teams building batch image editing services

Requires

PyTorch with CUDA for efficient batched tensor operations

GPU with sufficient VRAM for batch size (typically 8GB+ for batch_size=4 at 512x512)

Image dataset accessible as list or iterable

Limitations

Variable image sizes require padding to common dimensions, wasting GPU memory on smaller images

Batch size is limited by GPU VRAM; larger batches require larger GPUs or smaller images

Error in one image can fail entire batch unless error handling is implemented; requires careful exception management

What makes it unique

Implements dynamic batching with variable image size handling through padding/resizing, providing efficient GPU utilization for multi-image workloads while maintaining per-image metadata and error tracking for production robustness.

vs alternatives

More efficient than sequential single-image processing by batching multiple images on GPU; handles variable sizes automatically unlike naive batching approaches, and includes comprehensive error handling and progress tracking for production use.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with BrushNet, ranked by overlap. Discovered automatically through the match graph.

Repository53

IOPaint

Image inpainting tool powered by SOTA AI Model. Remove any unwanted object, defect, people from your pictures or erase and replace(powered by stable diffusion) any thing on your pictures.

stable diffusion-based object replacement and outpaintingtraditional inpainting with lama, mat, and zits models

2 shared capabilities

Dataset23

On Distillation of Guided Diffusion Models

* ⭐ 10/2022: [LAION-5B: An open large-scale dataset for training next generation image-text models (LAION-5B)](https://arxiv.org/abs/2210.08402)

high-quality inpainting with reduced computational cost

1 shared capability

Model20

Qwen-Image-Edit-2511-LoRAs-Fast

Qwen-Image-Edit-2511-LoRAs-Fast — AI demo on HuggingFace

mask-guided diffusion-based image inpainting

1 shared capability

Model25

Imagen

Imagen by Google is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language...

image inpainting and selective region editing

1 shared capability

Repository48

diffusionbee-stable-diffusion-ui

Diffusion Bee is the easiest way to run Stable Diffusion locally on your M1 Mac. Comes with a one-click installer. No dependencies or technical knowledge needed.

inpainting-selective-image-region-replacement

1 shared capability

Product20

Denoising Diffusion Probabilistic Models (DDPM)

* 🏆 2020: [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)](https://arxiv.org/abs/2010.11929)

image-inpainting-via-conditional-diffusion

1 shared capability

Best For

✓Computer vision researchers implementing plug-and-play diffusion extensions
✓ML engineers building image editing applications on top of Stable Diffusion
✓Teams requiring production-grade inpainting without full model retraining
✓Application developers building image editing UIs or APIs
✓Data scientists prototyping inpainting workflows
✓Teams deploying inpainting as a microservice
✓ML engineers optimizing models for production deployment
✓Teams building edge AI applications with resource constraints

Known Limitations

⚠Requires pre-trained base diffusion model (SD 1.5 or SDXL) — cannot function standalone
⚠Inference latency depends on base model's diffusion steps (typically 50-100 steps for quality results)
⚠Memory footprint scales with image resolution; 4K+ images may require gradient checkpointing or reduced batch sizes
⚠Mask quality directly impacts output quality — poorly defined masks produce artifacts at boundaries
⚠Pipeline initialization loads all model components into memory (~7GB for SD 1.5, ~13GB for SDXL) — requires GPU with sufficient VRAM
⚠Sequential processing of batches; no built-in distributed inference across multiple GPUs

Requirements

Python 3.9+PyTorch 1.12.1+CUDA 11.6+ (recommended for GPU inference)Hugging Face diffusers library with BrushNet custom modificationsPre-trained Stable Diffusion 1.5 or SDXL model weightsHugging Face transformers library (for CLIP text encoder)Pre-trained model weights accessible via HuggingFace Hub or local pathGPU with 8GB+ VRAM (16GB+ recommended for SDXL)

Input / Output

Accepts: PIL Image or numpy array (RGB, 512x512 or 768x768 typical), Binary mask (single-channel, same spatial dimensions as image), Text prompt (string, 1-77 tokens for SD 1.5, up to 256 for SDXL), Optional: negative prompt, guidance scale, number of inference steps, image: PIL Image or numpy array (RGB, uint8), mask: PIL Image or numpy array (grayscale, uint8 or binary), prompt: string (text description of desired inpainted content), negative_prompt: optional string, guidance_scale: float (typically 7.5-15.0), num_inference_steps: int (typically 20-100), generator: optional torch.Generator for reproducibility, model: torch.nn.Module (BrushNet or full pipeline), quantization_type: str ('int8', 'fp16', 'dynamic'), calibration_data: list of sample inputs (for post-training quantization), target_device: str ('cpu', 'gpu', 'edge'), model_id: str (HuggingFace model identifier or local path), scheduler: diffusers.SchedulerMixin (e.g., DDIMScheduler, EulerDiscreteScheduler), safety_checker: optional diffusers safety checker, feature_extractor: optional diffusers feature extractor, image: PIL Image or torch.Tensor (RGB, normalized to [-1, 1]), mask: PIL Image or torch.Tensor (grayscale, normalized to [0, 1]), generator: optional torch.Generator for reproducible noise sampling, unet: diffusers.UNet2DConditionModel (modified with BrushNet control), masked_image_latents: torch.Tensor (multi-scale features), mask: torch.Tensor (binary or soft mask), timestep: int (current diffusion step), encoder_hidden_states: torch.Tensor (text embeddings from CLIP), image: PIL Image or torch.Tensor, mask: PIL Image or torch.Tensor (binary or soft mask), mask_type: str ('segmentation' or 'random', optional for auto-detection), prompt: string, dataset_path: str (directory containing images or HuggingFace dataset identifier), base_model: str ('sd1.5' or 'sdxl'), mask_type: str ('segmentation' or 'random'), batch_size: int (typically 4-16 depending on GPU memory), learning_rate: float (typically 1e-4 to 1e-5), num_epochs: int (typically 10-50), generated_images: list of PIL Images or torch.Tensors (inpainted results), ground_truth_images: list of PIL Images or torch.Tensors (reference images), masks: list of PIL Images or torch.Tensors (optional, for masked metrics), batch_size: int (for efficient GPU processing), image: uploaded image file (PNG, JPG, WebP), mask: drawn on canvas or uploaded as image, prompt: text input field, negative_prompt: optional text input, guidance_scale: slider (typically 1-20), num_steps: slider (typically 20-100), instruction: string (natural language editing instruction), optional: reference_image (for style transfer or content guidance), images: list of PIL Images or torch.Tensors, masks: list of PIL Images or torch.Tensors (same length as images), prompts: list of strings (same length as images), batch_size: int (typically 2-8 depending on GPU memory), num_workers: int (for parallel data loading)

Produces: PIL Image (inpainted result, same dimensions as input), Latent tensor (if returning intermediate diffusion state), PIL Image (inpainted result), Optional: list of intermediate latents if return_dict=True, quantized_model: torch.nn.Module (optimized weights), quality_metrics: dict (LPIPS, FID on validation set before/after quantization), performance_metrics: dict (latency, throughput, model_size), pipeline: StableDiffusionBrushNetPipeline (fully initialized and ready for inference), latents: torch.Tensor (shape: [batch, 4, height//8, width//8]), masked_latents: torch.Tensor (latents with mask applied), mask_latent: torch.Tensor (downsampled mask in latent space), noise_pred: torch.Tensor (predicted noise with BrushNet control applied), intermediate_features: dict (optional, for debugging or analysis), metadata: dict (mask type used, confidence if auto-detected), model_checkpoint: torch.nn.Module (trained BrushNet weights), training_logs: dict (loss curves, metric values), evaluation_metrics: dict (LPIPS, FID, SSIM on validation set), metrics: dict with keys 'lpips', 'fid', 'ssim', each containing mean and std, per_image_metrics: list of dicts (individual image scores for detailed analysis), inpainted_image: PIL Image displayed in browser, inference_time: float (seconds, displayed to user), edited_image: PIL Image (result of instruction-guided inpainting), mask_used: PIL Image (generated mask for transparency/debugging), instruction_confidence: float (confidence in instruction interpretation), results: list of dicts, each containing {'image': PIL Image, 'time': float, 'success': bool, 'error': str or None}, summary: dict with aggregate statistics (total_time, success_rate, avg_time_per_image)

UnfragileRank

Adoption47%(35% weight)

Quality32%(20% weight)

Ecosystem58%(25% weight)

Match Graph10%(15% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Repository

12 capabilities

Visit BrushNet→

Repository Details

1,726

Stars

144

Forks

Python

Language

NOASSERTION

License

Topics

diffusiondiffusion-modelseccveccv2024image-inpaintingtext-to-image

Last commit: Dec 17, 2024

About

[ECCV 2024] The official implementation of paper "BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion"

Alternatives to BrushNet

Dreambooth-Stable-Diffusion45Repository

Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) with Stable Diffusion

Compare →

sdnext51Repository

SD.Next: All-in-one WebUI for AI generative image and video creation, captioning and processing

Compare →

fast-stable-diffusion48Repository

fast-stable-diffusion + DreamBooth

Compare →

ai-notes37Prompt

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.

Compare →

Are you the builder of BrushNet?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

github

Looking for something else?

Search →

Capabilities12 decomposed

decomposed dual-branch diffusion inpainting with masked feature separation

Medium confidence

Solves for

Best for

Computer vision researchers implementing plug-and-play diffusion extensions

ML engineers building image editing applications on top of Stable Diffusion

Teams requiring production-grade inpainting without full model retraining

Requires

Python 3.9+

PyTorch 1.12.1+

CUDA 11.6+ (recommended for GPU inference)

Limitations

Requires pre-trained base diffusion model (SD 1.5 or SDXL) — cannot function standalone

Inference latency depends on base model's diffusion steps (typically 50-100 steps for quality results)

Memory footprint scales with image resolution; 4K+ images may require gradient checkpointing or reduced batch sizes

What makes it unique

vs alternatives

text-guided inpainting pipeline with multi-variant model support

Medium confidence

Solves for

Best for

Application developers building image editing UIs or APIs

Data scientists prototyping inpainting workflows

Teams deploying inpainting as a microservice

Requires

Python 3.9+

PyTorch 1.12.1+

Hugging Face transformers library (for CLIP text encoder)

Limitations

Pipeline initialization loads all model components into memory (~7GB for SD 1.5, ~13GB for SDXL) — requires GPU with sufficient VRAM

Sequential processing of batches; no built-in distributed inference across multiple GPUs

Text encoding is fixed to CLIP tokenizer (77 tokens for SD 1.5); longer prompts are truncated

What makes it unique

vs alternatives

Simpler API than raw diffusers pipeline usage while maintaining full control over inference parameters; supports both SD 1.5 and SDXL without code changes, unlike single-model implementations.

model weight quantization and optimization for deployment

Medium confidence

Solves for

Best for

ML engineers optimizing models for production deployment

Teams building edge AI applications with resource constraints

Organizations requiring high-throughput inference on limited hardware

Requires

PyTorch with quantization support (torch.quantization)

Optional: TensorRT or ONNX for advanced optimization

Calibration dataset for post-training quantization (typically 100-500 images)

Limitations

INT8 quantization can cause 2-5% quality degradation (LPIPS) compared to FP32; requires validation on target domain

Quantization is model-specific; quantized weights cannot be transferred between different base models (SD 1.5 vs SDXL)

FP16 mixed precision may cause numerical instability in some operations; requires careful testing

What makes it unique

vs alternatives

integration with huggingface diffusers ecosystem

Medium confidence

Solves for

Best for

Developers already using HuggingFace diffusers in their projects

Teams building modular diffusion pipelines with multiple extensions

Researchers combining BrushNet with other diffusion techniques

Requires

HuggingFace diffusers library (0.21.0+)

PyTorch 1.12.1+

Transformers library for CLIP text encoder

Limitations

Integration is limited to diffusers API conventions; custom diffusion implementations may not be compatible

Some diffusers features (e.g., IP-Adapter, certain LoRA variants) may not work seamlessly with BrushNet without additional integration work

Scheduler compatibility depends on diffusers version; older versions may lack support for newer schedulers

What makes it unique

vs alternatives

mask-aware latent encoding and feature extraction

Medium confidence

Solves for

Best for

ML engineers optimizing inference performance by working in latent space

Researchers studying diffusion model behavior in latent representations

Teams building custom inpainting pipelines with specialized preprocessing needs

Requires

Pre-trained VAE encoder (included in SD 1.5 or SDXL model)

PyTorch with CUDA support for efficient tensor operations

Input image dimensions must be multiples of 8 (e.g., 512x512, 768x768)

Limitations

VAE encoding introduces quantization artifacts due to 8x spatial downsampling — fine details in masks may be lost

Mask must be resized to match latent dimensions (typically 64x64 for 512x512 images); interpolation can introduce boundary artifacts

Latent space representation is model-specific (SD 1.5 VAE differs from SDXL VAE) — cannot transfer latents between models

What makes it unique

vs alternatives

multi-resolution dense per-pixel control injection

Medium confidence

Solves for

Best for

Researchers studying multi-scale diffusion control mechanisms

ML engineers implementing custom diffusion guidance strategies

Teams requiring fine-grained control over inpainting quality and boundary preservation

Requires

Modified UNet implementation with BrushNet control injection points

Mask and masked image features preprocessed at multiple resolutions

PyTorch with support for custom attention module modifications

Limitations

Multi-scale injection increases inference latency by ~15-25% compared to single-scale approaches due to additional feature processing

Requires careful tuning of control weights at each resolution level — improper weighting can cause artifacts or loss of detail

Control injection modifies attention patterns, which may affect generation diversity or introduce mode collapse at certain guidance scales

What makes it unique

vs alternatives

segmentation and random mask variant support

Medium confidence

Solves for

Best for

Computer vision applications using semantic segmentation for object removal/replacement

Interactive image editing tools where users draw arbitrary masks

Production systems requiring robust handling of diverse mask types

Requires

Separate pre-trained model weights for segmentation and random mask variants

Mask preprocessing to ensure proper format (binary or soft mask, normalized to [0, 1])

Limitations

Using wrong variant for mask type (e.g., segmentation variant on random mask) degrades quality by ~10-15% LPIPS

Segmentation variant assumes clean boundaries; noisy or anti-aliased mask edges may produce artifacts

Random mask variant may over-smooth boundaries on clean segmentation masks, losing precision

What makes it unique

vs alternatives

training pipeline with dataset preparation and augmentation

Medium confidence

Solves for

Best for

ML engineers building domain-specific inpainting models

Research teams experimenting with BrushNet variants

Organizations with custom datasets requiring specialized inpainting models

Requires

Python 3.9+

PyTorch 1.12.1+ with CUDA support

GPU with 24GB+ VRAM (A100 or equivalent recommended)

Limitations

Training requires 24-48 hours on single A100 GPU for convergence; multi-GPU training setup is non-trivial

Requires large-scale image datasets (100k+ images recommended) for stable training; smaller datasets may overfit

Mask augmentation strategies must be carefully tuned for target use case; poor augmentation leads to poor generalization

What makes it unique

vs alternatives

evaluation metrics computation (lpips, fid, ssim)

Medium confidence

Solves for

Best for

Researchers publishing inpainting papers with quantitative results

ML engineers comparing model variants during development

Teams establishing quality baselines for production models

Requires

PyTorch with torchvision (for pre-trained feature extractors)

Pre-trained InceptionV3 model (auto-downloaded from torchvision)

Test dataset with ground truth images and masks

Limitations

LPIPS and FID require pre-trained feature extractors (InceptionV3, AlexNet) which add ~2-5 seconds per image evaluation overhead

Metrics are sensitive to image preprocessing (normalization, resizing) — must match training/evaluation setup exactly

LPIPS and FID are not perfectly correlated with human perception; high metric scores don't guarantee visual quality

What makes it unique

vs alternatives

gradio web interface for interactive inpainting

Medium confidence

Solves for

Best for

Non-technical users exploring inpainting capabilities

Product teams prototyping image editing features

Researchers sharing models with collaborators

Requires

Python 3.9+

Gradio library (pip install gradio)

Pre-trained BrushNet model weights

Limitations

Gradio interface adds ~500ms-1s overhead per request due to HTTP serialization and image encoding/decoding

Mask drawing tools are basic (brush, eraser) — complex masks are tedious to create; better suited for simple object removal

Single-user inference; concurrent requests queue sequentially on single GPU

What makes it unique

vs alternatives

instruction-guided editing with text-based spatial control

Medium confidence

Solves for

Best for

End-user image editing applications with natural language interfaces

Teams building AI-powered photo editing tools

Accessibility-focused applications where drawing masks is difficult

Requires

Pre-trained object detection or segmentation model (e.g., YOLO, SAM, Mask R-CNN)

Natural language processing for instruction parsing (rule-based or LLM-based)

BrushNet inpainting pipeline for final generation

Limitations

Instruction parsing requires additional models (object detection, segmentation) which add latency (~1-2 seconds per instruction)

Spatial understanding is limited to objects detectable by auxiliary models; abstract concepts ('make it more vibrant') cannot be spatially grounded

Instruction ambiguity can lead to incorrect mask generation; 'remove the person' may fail if multiple people are present

What makes it unique

vs alternatives

batch processing with multi-image inpainting

Medium confidence

Solves for

Best for

Production systems processing large image collections

Data preprocessing pipelines requiring bulk inpainting

Teams building batch image editing services

Requires

PyTorch with CUDA for efficient batched tensor operations

GPU with sufficient VRAM for batch size (typically 8GB+ for batch_size=4 at 512x512)

Image dataset accessible as list or iterable

Limitations

Variable image sizes require padding to common dimensions, wasting GPU memory on smaller images

Batch size is limited by GPU VRAM; larger batches require larger GPUs or smaller images

Error in one image can fail entire batch unless error handling is implemented; requires careful exception management

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to BrushNet

Dreambooth-Stable-Diffusion45Repository

Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) with Stable Diffusion

Compare →

sdnext51Repository

SD.Next: All-in-one WebUI for AI generative image and video creation, captioning and processing

Compare →

fast-stable-diffusion48Repository

fast-stable-diffusion + DreamBooth

Compare →

ai-notes37Prompt

Compare →

BrushNet

Capabilities12 decomposed

decomposed dual-branch diffusion inpainting with masked feature separation

text-guided inpainting pipeline with multi-variant model support

model weight quantization and optimization for deployment

integration with huggingface diffusers ecosystem

mask-aware latent encoding and feature extraction

multi-resolution dense per-pixel control injection

segmentation and random mask variant support

training pipeline with dataset preparation and augmentation

evaluation metrics computation (lpips, fid, ssim)

gradio web interface for interactive inpainting

instruction-guided editing with text-based spatial control

batch processing with multi-image inpainting

Related Artifactssharing capabilities

IOPaint

On Distillation of Guided Diffusion Models

Qwen-Image-Edit-2511-LoRAs-Fast

Imagen

diffusionbee-stable-diffusion-ui

Denoising Diffusion Probabilistic Models (DDPM)

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Repository Details

About

Categories

Alternatives to BrushNet

Are you the builder of BrushNet?

Get the weekly brief

Data Sources

BrushNet

Capabilities12 decomposed

decomposed dual-branch diffusion inpainting with masked feature separation

text-guided inpainting pipeline with multi-variant model support

model weight quantization and optimization for deployment

integration with huggingface diffusers ecosystem

mask-aware latent encoding and feature extraction

multi-resolution dense per-pixel control injection

segmentation and random mask variant support

training pipeline with dataset preparation and augmentation

evaluation metrics computation (lpips, fid, ssim)

gradio web interface for interactive inpainting

instruction-guided editing with text-based spatial control

batch processing with multi-image inpainting

Related Artifactssharing capabilities

IOPaint

On Distillation of Guided Diffusion Models

Qwen-Image-Edit-2511-LoRAs-Fast

Imagen

diffusionbee-stable-diffusion-ui

Denoising Diffusion Probabilistic Models (DDPM)

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Repository Details

About

Categories

Alternatives to BrushNet

Are you the builder of BrushNet?

Get the weekly brief

Data Sources