{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic","slug":"facebook--mask2former-swin-large-ade-semantic","name":"mask2former-swin-large-ade-semantic","type":"model","url":"https://huggingface.co/facebook/mask2former-swin-large-ade-semantic","page_url":"https://unfragile.ai/facebook--mask2former-swin-large-ade-semantic","categories":["image-generation"],"tags":["transformers","pytorch","safetensors","mask2former","vision","image-segmentation","dataset:coco","arxiv:2112.01527","arxiv:2107.06278","license:other","endpoints_compatible","deploy:azure","region:us"],"pricing":{"model":"open_source","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_0","uri":"capability://image.visual.panoptic.aware.semantic.segmentation.with.mask.classification","name":"panoptic-aware semantic segmentation with mask classification","description":"Performs dense pixel-level semantic segmentation using a Mask2Former architecture that combines masked attention mechanisms with a Swin Transformer backbone. The model processes images through a multi-scale feature pyramid, applies mask-based queries to isolate semantic regions, and classifies each mask against 150 ADE20K semantic classes. Unlike traditional FCN-based segmentation, it uses learnable mask tokens that attend only to relevant spatial regions, reducing computational overhead while improving boundary precision.","intents":["segment indoor and outdoor scenes into semantic categories for scene understanding applications","extract precise object and stuff boundaries for robotics and autonomous systems","generate dense semantic annotations for training downstream vision models","analyze complex multi-class scenes with fine-grained category distinctions"],"best_for":["computer vision researchers building scene understanding pipelines","robotics teams needing real-time environment parsing","teams fine-tuning models on domain-specific segmentation tasks","developers building indoor navigation or spatial analysis systems"],"limitations":["Trained exclusively on ADE20K indoor/outdoor scenes — performance degrades on out-of-distribution domains (medical imaging, satellite imagery, industrial inspection)","Inference latency ~500-800ms on GPU for 1024x1024 images; CPU inference impractical for real-time applications","Memory footprint ~1.3GB for model weights; requires GPU with 8GB+ VRAM for batch processing","Fixed 150-class output space; requires retraining or adapter layers for custom semantic categories","Struggles with very small objects (<2% image area) and thin structures due to mask-based attention design"],"requires":["PyTorch 1.9+","transformers library 4.25+","CUDA 11.0+ or compatible GPU (RTX 3060 minimum recommended)","Python 3.8+","detectron2 library for inference utilities"],"input_types":["RGB images (3-channel, arbitrary resolution)","image tensors normalized to ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])"],"output_types":["semantic segmentation masks (HxW integer tensor with class indices 0-149)","per-pixel class probabilities (HxWx150 float tensor)","instance masks for panoptic interpretation"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_1","uri":"capability://image.visual.multi.scale.hierarchical.feature.extraction.with.swin.transformer.backbone","name":"multi-scale hierarchical feature extraction with swin transformer backbone","description":"Extracts image features through a Swin Transformer encoder that processes images in shifted-window blocks across 4 hierarchical stages, producing multi-scale feature maps at 1/4, 1/8, 1/16, and 1/32 resolution. Each stage applies self-attention within local windows (7x7 default) with periodic shifts to enable cross-window communication, generating features that capture both fine-grained details and semantic context. This hierarchical design enables the subsequent Mask2Former decoder to operate efficiently across scales without explicit dilated convolutions.","intents":["extract multi-resolution feature representations suitable for dense prediction tasks","reduce computational cost of vision transformers through local window attention vs global attention","enable transfer learning by leveraging ImageNet-pretrained Swin weights","support downstream tasks requiring both local detail and global semantic context"],"best_for":["teams building custom segmentation models that need pretrained feature extractors","researchers comparing transformer vs CNN backbones for dense prediction","production systems requiring efficient feature extraction without full model retraining"],"limitations":["Window-attention design creates artificial boundaries at window edges; requires shifted windows to mitigate but adds complexity","Swin-Large has 196M parameters; fine-tuning requires careful learning rate scheduling and gradient accumulation","Feature maps at 1/32 resolution lose fine-grained spatial information; requires upsampling for pixel-accurate predictions","Positional embeddings are absolute and resolution-dependent; direct transfer to different input resolutions requires interpolation"],"requires":["PyTorch 1.9+","timm library 0.6.0+ for Swin implementation","GPU with 16GB+ VRAM for fine-tuning"],"input_types":["RGB images (3-channel, typically 512x512 or 1024x1024)"],"output_types":["feature pyramids at 4 scales (C4, C8, C16, C32 stride)","feature tensors with 96/192/384/768 channels per stage"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_2","uri":"capability://image.visual.mask.based.query.decoding.with.cross.attention.refinement","name":"mask-based query decoding with cross-attention refinement","description":"Decodes multi-scale features into semantic masks through a Mask2Former decoder that maintains a set of learnable mask queries (typically 100-200 queries per image). Each query attends to image features via cross-attention, generating a binary mask prediction and semantic class logit. The decoder iteratively refines masks across 9 transformer layers, with each layer updating both mask embeddings and spatial attention weights. Masks are upsampled to full resolution and post-processed via CRF or morphological operations to enforce spatial consistency.","intents":["convert multi-scale image features into instance-aware semantic masks with class predictions","refine mask boundaries through iterative cross-attention without explicit boundary detection networks","handle variable numbers of objects/regions through query-based decoding rather than fixed-grid predictions","enable end-to-end differentiable segmentation for joint optimization with downstream tasks"],"best_for":["researchers implementing mask-based segmentation architectures","teams requiring interpretable attention maps for model debugging","applications needing instance-level semantic understanding alongside dense predictions"],"limitations":["Query-based decoding requires careful initialization; poor query initialization leads to mode collapse where multiple queries predict identical masks","Computational cost scales with number of queries; 200 queries adds ~150ms latency vs 100 queries","Cross-attention mechanism requires storing attention maps for all query-feature pairs; 8GB+ VRAM needed for batch size >2 at 1024x1024","Mask refinement is iterative; early stopping or layer pruning significantly degrades boundary quality","Post-processing (CRF, morphological ops) adds 50-100ms latency and requires careful hyperparameter tuning per domain"],"requires":["PyTorch 1.9+ with autograd support","detectron2 0.6+ for mask operations","CUDA 11.0+ for efficient cross-attention kernels"],"input_types":["multi-scale feature pyramids from Swin backbone","learnable mask query embeddings (100-200 x 256-dim)"],"output_types":["binary mask predictions (HxW per query)","semantic class logits (150-class per query)","attention weight maps for interpretability"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_3","uri":"capability://data.processing.analysis.ade20k.150.class.semantic.taxonomy.mapping","name":"ade20k 150-class semantic taxonomy mapping","description":"Maps predicted mask queries to a fixed set of 150 semantic classes from the ADE20K dataset, which includes diverse indoor/outdoor scene categories (e.g., wall, floor, ceiling, tree, person, car, sky). The model outputs class logits for each mask query, which are converted to class indices via argmax. The taxonomy includes both 'thing' classes (countable objects like people, cars) and 'stuff' classes (amorphous regions like sky, grass), enabling panoptic-style interpretation where both instance and semantic information are available.","intents":["classify segmented regions into standardized ADE20K semantic categories for scene understanding","enable downstream tasks that require semantic labels (e.g., autonomous navigation, indoor mapping)","support transfer learning by leveraging ADE20K pretraining for domain-specific fine-tuning","provide consistent class indices across different inference runs and batch sizes"],"best_for":["teams building scene understanding systems for indoor/outdoor environments","researchers fine-tuning on domain-specific segmentation with ADE20K as pretraining","applications requiring standardized semantic labels for interoperability"],"limitations":["Fixed 150-class output space; custom categories require retraining or adapter layers (e.g., linear projection + softmax)","Class imbalance in ADE20K (e.g., 'wall' dominates; rare classes like 'escalator' have <0.1% pixels) leads to poor recall on underrepresented categories","Taxonomy is English-language; multilingual applications require external label mapping","Some ADE20K classes are ambiguous or overlapping (e.g., 'building' vs 'house'); model predictions may be inconsistent at class boundaries","Out-of-distribution domains (medical imaging, satellite imagery) have no semantic mapping to ADE20K classes"],"requires":["ADE20K class index mapping (provided in model config)","knowledge of ADE20K taxonomy for interpreting predictions"],"input_types":["class logits from Mask2Former decoder (150-dim per query)"],"output_types":["class indices (0-149 per pixel)","class probabilities (softmax over 150 classes)","class names (string labels for visualization)"],"categories":["data-processing-analysis","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_4","uri":"capability://image.visual.batch.inference.with.dynamic.input.resolution.handling","name":"batch inference with dynamic input resolution handling","description":"Supports inference on variable-resolution images through dynamic padding and resizing strategies that maintain aspect ratio while fitting images into GPU memory. The model accepts images of arbitrary size, internally resizes to a multiple of 32 (e.g., 512x512, 1024x1024), and outputs segmentation masks at the original resolution through bilinear upsampling. Batch processing is supported with automatic padding to match the largest image in the batch, enabling efficient GPU utilization for multiple images.","intents":["process images of different resolutions without retraining or model modification","maximize GPU throughput by batching variable-resolution images with automatic padding","maintain output mask resolution matching input images for downstream pixel-level tasks","handle real-world image streams with heterogeneous dimensions"],"best_for":["production systems processing diverse image sources (mobile cameras, webcams, surveillance feeds)","batch processing pipelines requiring efficient GPU utilization","applications requiring output masks at original image resolution"],"limitations":["Dynamic padding adds computational overhead; images with extreme aspect ratios (e.g., 100x10000) waste GPU memory on padding","Bilinear upsampling to original resolution introduces interpolation artifacts; masks may have jagged boundaries if upsampled >2x","Batch processing requires all images to be padded to the largest image size; heterogeneous batches reduce GPU efficiency by 10-30%","No built-in multi-GPU batching; distributed inference requires external orchestration (e.g., Ray, vLLM)","Memory usage scales linearly with batch size; batch size >4 at 1024x1024 requires 24GB+ VRAM"],"requires":["PyTorch 1.9+ with CUDA support","transformers library with dynamic padding support","GPU with 8GB+ VRAM for single-image inference, 16GB+ for batching"],"input_types":["RGB images (3-channel, arbitrary resolution)","batches of images with variable dimensions"],"output_types":["segmentation masks at original input resolution","class indices and probabilities"],"categories":["image-visual","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_5","uri":"capability://image.visual.post.processing.with.morphological.refinement.and.crf.smoothing","name":"post-processing with morphological refinement and crf smoothing","description":"Refines raw mask predictions through optional morphological operations (erosion, dilation, opening, closing) and Conditional Random Field (CRF) smoothing that enforces spatial consistency. Morphological operations remove small spurious predictions and fill holes in masks. CRF smoothing models pixel-level dependencies based on color similarity and spatial proximity, iteratively updating mask labels to maximize consistency with image features. This post-processing is applied after upsampling to original resolution and can be toggled based on application requirements.","intents":["remove noise and small artifacts from raw mask predictions","enforce spatial consistency and smooth mask boundaries","improve boundary precision for downstream tasks like instance tracking or 3D reconstruction","trade off inference latency vs output quality through configurable post-processing"],"best_for":["applications requiring clean, smooth segmentation masks (e.g., image editing, 3D reconstruction)","systems where boundary precision is critical (e.g., medical imaging, robotics)","teams willing to trade 50-100ms latency for 2-3% mIoU improvement"],"limitations":["Morphological operations are sensitive to kernel size; over-aggressive erosion removes fine details, under-aggressive dilation leaves noise","CRF smoothing adds 50-150ms latency depending on image resolution and number of iterations (typically 10-20)","CRF requires careful hyperparameter tuning (spatial bandwidth, color bandwidth, compatibility matrix); poor tuning can degrade boundaries","Post-processing is non-differentiable; cannot be included in end-to-end training pipelines","Morphological operations assume binary masks; multi-class masks require per-class processing, increasing latency"],"requires":["OpenCV 4.0+ for morphological operations","pydensecrf library for CRF inference","scikit-image for advanced morphological operations"],"input_types":["binary or multi-class segmentation masks (HxW integer tensor)","original RGB image for CRF color features"],"output_types":["refined segmentation masks (HxW integer tensor)","optionally, boundary confidence maps"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_6","uri":"capability://image.visual.transfer.learning.and.fine.tuning.on.custom.datasets","name":"transfer learning and fine-tuning on custom datasets","description":"Enables fine-tuning the pretrained Mask2Former model on custom segmentation datasets through standard PyTorch training loops. The model's weights are initialized from ADE20K pretraining, and can be adapted to new domains by training on custom labeled data. Fine-tuning typically involves freezing the Swin backbone for initial epochs, then unfreezing for full-model training. Custom datasets require annotation in standard formats (COCO JSON, semantic segmentation masks) and can have arbitrary numbers of classes, enabling domain adaptation without retraining from scratch.","intents":["adapt the model to domain-specific segmentation tasks (medical imaging, satellite imagery, industrial inspection)","reduce training time and data requirements by leveraging ADE20K pretraining","fine-tune on custom class taxonomies different from ADE20K's 150 classes","build production models for niche applications with limited labeled data"],"best_for":["teams with domain-specific segmentation tasks and 500-5000 labeled images","researchers comparing transfer learning vs training from scratch","production teams needing to adapt models to new domains without full retraining"],"limitations":["Fine-tuning requires careful hyperparameter selection (learning rate, warmup, weight decay); poor tuning leads to catastrophic forgetting or divergence","Swin-Large backbone has 196M parameters; full fine-tuning requires 16GB+ VRAM and 2-4 days on single GPU for 5000-image datasets","Custom class taxonomies require modifying the classification head (final linear layer); retraining the head alone may underfit if domain is very different from ADE20K","Overfitting is common with <1000 labeled images; requires aggressive data augmentation and regularization","No built-in domain adaptation techniques (e.g., adversarial training, self-supervised learning); requires manual implementation"],"requires":["PyTorch 1.9+","transformers library 4.25+","detectron2 for training utilities","custom dataset in COCO JSON or semantic segmentation mask format","GPU with 16GB+ VRAM for full fine-tuning","500+ labeled images for meaningful fine-tuning"],"input_types":["custom RGB images","semantic segmentation masks or COCO-format annotations"],"output_types":["fine-tuned model weights","segmentation masks for custom classes"],"categories":["image-visual","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_7","uri":"capability://automation.workflow.model.export.and.deployment.to.edge.devices","name":"model export and deployment to edge devices","description":"Supports exporting the trained model to optimized formats (ONNX, TorchScript, TensorRT) for deployment on edge devices and cloud inference endpoints. The model can be quantized (int8, fp16) to reduce size and latency, enabling deployment on resource-constrained devices (mobile, embedded systems). HuggingFace integration provides one-click deployment to cloud endpoints (AWS SageMaker, Azure ML, Hugging Face Inference API) with automatic batching and scaling.","intents":["deploy segmentation models to edge devices (mobile, embedded systems, IoT) with reduced latency and memory","quantize models to int8 or fp16 for 4-8x size reduction and 2-3x speedup","export to ONNX or TensorRT for cross-platform inference (CPU, GPU, TPU)","leverage HuggingFace Inference API for serverless deployment without infrastructure management"],"best_for":["teams deploying models to mobile or embedded systems","production systems requiring sub-100ms latency on edge devices","startups needing serverless inference without managing infrastructure","researchers benchmarking model efficiency across hardware platforms"],"limitations":["Quantization (int8) introduces 1-3% mIoU degradation due to reduced precision; requires fine-tuning on quantized models for minimal loss","ONNX export requires careful operator mapping; some custom operations (e.g., deformable convolutions) may not be supported","TensorRT optimization is GPU-specific (NVIDIA only); requires separate optimization for other hardware (Apple Neural Engine, Qualcomm Hexagon)","Edge deployment requires model size <500MB for mobile devices; Swin-Large (1.3GB) requires aggressive quantization or distillation","HuggingFace Inference API has cold-start latency (~2-5s) and per-inference costs; not suitable for latency-critical applications"],"requires":["PyTorch 1.9+ with export support","onnx library 1.10+ for ONNX export","TensorRT 8.0+ for GPU optimization (optional)","HuggingFace account for cloud deployment (optional)","Mobile framework (TensorFlow Lite, Core ML, NNAPI) for on-device inference"],"input_types":["PyTorch model checkpoint","quantization configuration (int8, fp16)"],"output_types":["ONNX model (.onnx)","TorchScript model (.pt)","TensorRT engine (.trt)","quantized model weights"],"categories":["automation-workflow","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_8","uri":"capability://image.visual.interpretability.and.attention.visualization","name":"interpretability and attention visualization","description":"Provides attention weight maps from the Mask2Former decoder that visualize which image regions each mask query attends to during prediction. These attention maps can be overlaid on input images to understand model decisions and debug failure cases. Additionally, intermediate mask predictions from each decoder layer can be extracted to visualize iterative mask refinement. This enables model interpretability without external saliency methods, as attention weights directly reflect the model's spatial focus.","intents":["debug model failures by visualizing which image regions influenced predictions","understand model behavior and build trust in predictions for safety-critical applications","identify systematic biases (e.g., over-reliance on texture vs shape)","validate that the model learned meaningful features rather than spurious correlations"],"best_for":["researchers studying transformer attention mechanisms in vision","teams building safety-critical systems (medical imaging, autonomous vehicles) requiring model interpretability","developers debugging model failures on edge cases"],"limitations":["Attention weights are high-dimensional (100-200 queries x HxW spatial locations); visualization requires dimensionality reduction or aggregation","Attention weights reflect model focus but don't directly explain predictions; high attention to a region doesn't guarantee correct classification","Extracting intermediate layers adds memory overhead (~20-30% increase) and requires custom forward hooks","Attention visualization is qualitative; quantitative interpretability metrics (e.g., faithfulness, sensitivity) require additional analysis","Attention patterns are task-specific; patterns learned on ADE20K may not transfer to other domains"],"requires":["PyTorch with hook support for extracting intermediate activations","visualization libraries (matplotlib, plotly) for rendering attention maps","understanding of transformer attention mechanisms"],"input_types":["RGB images","model forward pass with attention extraction enabled"],"output_types":["attention weight maps (100-200 x H x W)","intermediate mask predictions from each decoder layer","visualizations overlaying attention on input images"],"categories":["image-visual","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-model-facebook--mask2former-swin-large-ade-semantic__cap_9","uri":"capability://image.visual.panoptic.segmentation.interpretation.with.instance.grouping","name":"panoptic segmentation interpretation with instance grouping","description":"Enables panoptic-style interpretation where both semantic labels and instance grouping are available from mask predictions. Each mask query produces both a semantic class and a binary mask; masks can be grouped by class to create instance-level segmentations for 'thing' classes (e.g., separate instances of 'person' or 'car') while treating 'stuff' classes (e.g., 'wall', 'sky') as single regions. This hybrid representation combines the benefits of semantic segmentation (dense pixel labels) and instance segmentation (object-level grouping).","intents":["perform panoptic segmentation (instance + semantic) in a single forward pass without separate instance detection","count objects by grouping masks by class (e.g., 'how many people are in the image?')","enable downstream tasks requiring both semantic and instance information (e.g., scene graphs, object tracking)","support applications where instance-level understanding is needed for some classes but not others"],"best_for":["scene understanding systems requiring both semantic and instance information","robotics applications needing object counting and localization","teams building video understanding systems with instance tracking","applications requiring panoptic segmentation without separate instance detection networks"],"limitations":["Instance grouping is implicit in mask queries; no explicit instance ID assignment, requiring post-processing to group masks by class","Number of instances is limited by number of mask queries (typically 100-200); scenes with >200 objects will have missed detections","Stuff classes (wall, sky) are treated as single regions; cannot distinguish multiple disconnected regions of the same class without post-processing","Instance grouping requires class-specific logic; custom taxonomies need manual definition of thing vs stuff classes","No built-in instance tracking across frames; temporal consistency requires external tracking algorithms"],"requires":["semantic class definitions (thing vs stuff)","post-processing logic to group masks by class and assign instance IDs"],"input_types":["RGB images"],"output_types":["panoptic segmentation masks (HxW with instance IDs encoded as class*1000 + instance_id)","per-instance semantic labels and bounding boxes"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":44,"verified":false,"data_access_risk":"low","permissions":["PyTorch 1.9+","transformers library 4.25+","CUDA 11.0+ or compatible GPU (RTX 3060 minimum recommended)","Python 3.8+","detectron2 library for inference utilities","timm library 0.6.0+ for Swin implementation","GPU with 16GB+ VRAM for fine-tuning","PyTorch 1.9+ with autograd support","detectron2 0.6+ for mask operations","CUDA 11.0+ for efficient cross-attention kernels"],"failure_modes":["Trained exclusively on ADE20K indoor/outdoor scenes — performance degrades on out-of-distribution domains (medical imaging, satellite imagery, industrial inspection)","Inference latency ~500-800ms on GPU for 1024x1024 images; CPU inference impractical for real-time applications","Memory footprint ~1.3GB for model weights; requires GPU with 8GB+ VRAM for batch processing","Fixed 150-class output space; requires retraining or adapter layers for custom semantic categories","Struggles with very small objects (<2% image area) and thin structures due to mask-based attention design","Window-attention design creates artificial boundaries at window edges; requires shifted windows to mitigate but adds complexity","Swin-Large has 196M parameters; fine-tuning requires careful learning rate scheduling and gradient accumulation","Feature maps at 1/32 resolution lose fine-grained spatial information; requires upsampling for pixel-accurate predictions","Positional embeddings are absolute and resolution-dependent; direct transfer to different input resolutions requires interpolation","Query-based decoding requires careful initialization; poor query initialization leads to mode collapse where multiple queries predict identical masks","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.5238019255854117,"quality":0.45,"ecosystem":0.5000000000000001,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.35,"quality":0.2,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:22.765Z","last_scraped_at":"2026-05-03T14:23:00.162Z","last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":119949,"model_likes":21}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=facebook--mask2former-swin-large-ade-semantic","compare_url":"https://unfragile.ai/compare?artifact=facebook--mask2former-swin-large-ade-semantic"}},"signature":"d7wwKS/AHBF14+qXo/y6Ds9O0XvCl/6KYGrYeMuhz4yjlCKJcDFVsB6RmWvtHC0po6eKKAFZVv6e20QrYfIZBw==","signedAt":"2026-06-21T09:28:21.609Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/facebook--mask2former-swin-large-ade-semantic","artifact":"https://unfragile.ai/facebook--mask2former-swin-large-ade-semantic","verify":"https://unfragile.ai/api/v1/verify?slug=facebook--mask2former-swin-large-ade-semantic","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}