Capability
Multimodal Prompt Fusion For Text Sketch Coherence
20 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “cross-attention fusion of image features and prompt embeddings”
Meta's foundation model for visual segmentation.
Unique: Uses bidirectional cross-attention where both prompts attend to image features and image features attend to prompts, enabling mutual refinement. This design allows prompts to disambiguate image regions and image context to refine prompt interpretation.
vs others: More principled than concatenation-based fusion because attention learns which image regions are relevant to each prompt, avoiding feature dilution from irrelevant image regions and enabling explicit multi-prompt composition.