{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"blip-2","slug":"blip-2","name":"BLIP-2","type":"model","url":"https://github.com/salesforce/LAVIS/tree/main/projects/blip2","page_url":"https://unfragile.ai/blip-2","categories":["model-training"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"blip-2__cap_0","uri":"capability://image.visual.frozen.encoder.visual.feature.extraction.with.querying.transformer.bridging","name":"frozen-encoder visual feature extraction with querying transformer bridging","description":"BLIP-2 extracts visual features from frozen pre-trained image encoders (CLIP ViT, EVA-CLIP) without fine-tuning them, then bridges the frozen encoder output to LLM embedding space using a lightweight Querying Transformer (Q-Former) that learns task-specific visual representations. The Q-Former uses learnable query tokens that attend to frozen image features via cross-attention, enabling efficient adaptation of any frozen vision encoder to any LLM without modifying either component.","intents":["I want to leverage a frozen CLIP or EVA-CLIP encoder without retraining it while connecting it to an LLM for multimodal tasks","I need to reduce training compute by keeping vision encoders frozen and only training a lightweight adapter module","I want to compose different frozen vision encoders with different LLMs without architectural conflicts"],"best_for":["researchers building efficient vision-language models with limited compute budgets","teams wanting to reuse frozen pre-trained vision encoders across multiple LLM backends","practitioners needing rapid prototyping of multimodal systems without full model retraining"],"limitations":["frozen encoders cannot adapt to domain-specific visual patterns — performance capped by pre-training distribution","Q-Former adds ~50-100ms latency per image due to cross-attention computation over all image patches","requires careful tuning of query token count (32-256) to balance expressiveness vs computational cost","no built-in mechanism for multi-resolution image inputs — fixed input size inherited from frozen encoder"],"requires":["PyTorch 1.10.0+","Python 3.7+","pre-trained frozen image encoder (CLIP ViT-L/14, EVA-CLIP, or equivalent)","target LLM with known embedding dimension (OPT, Llama, etc.)"],"input_types":["image (RGB, 224×224 or 336×336 depending on encoder)","frozen encoder checkpoint (PyTorch .pt or .pth)"],"output_types":["visual embeddings (shape: [batch, num_queries, hidden_dim])","attention maps (Q-Former cross-attention weights)"],"categories":["image-visual","model-architecture"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_1","uri":"capability://image.visual.zero.shot.visual.question.answering.with.instruction.following","name":"zero-shot visual question answering with instruction-following","description":"BLIP-2 performs visual question answering by encoding an image through the frozen vision encoder + Q-Former, then feeding the visual embeddings as soft prompts into a frozen LLM (OPT or Llama) that generates answers in natural language. The model is trained with instruction-following objectives (e.g., 'Question: ... Answer:' templates) enabling zero-shot VQA on unseen question types without task-specific fine-tuning, leveraging the LLM's generalization capabilities.","intents":["I want to answer arbitrary questions about images without training on VQA datasets","I need to handle diverse question types (counting, reasoning, factual) with a single model","I want to generate natural language answers that follow instruction templates without task-specific heads"],"best_for":["developers building general-purpose image understanding applications","researchers evaluating zero-shot transfer of vision-language models","teams needing flexible VQA without dataset-specific fine-tuning"],"limitations":["zero-shot performance degrades on complex reasoning questions requiring multi-step logic","LLM generation can hallucinate plausible-sounding but incorrect answers due to limited visual grounding","inference latency ~1-3 seconds per image due to autoregressive LLM decoding","no built-in confidence scores or uncertainty quantification for answer reliability","performance bounded by frozen LLM's knowledge cutoff and instruction-following ability"],"requires":["PyTorch 1.10.0+","Python 3.7+","pre-trained BLIP-2 checkpoint (Q-Former + frozen encoder + frozen LLM)","image input (224×224 or 336×336 RGB)","optional: custom instruction templates for domain-specific prompting"],"input_types":["image (RGB, fixed resolution)","question text (natural language string)","optional: instruction template (e.g., 'Question: {q} Answer:')"],"output_types":["answer text (natural language string, variable length)","token-level logits (for confidence estimation)"],"categories":["image-visual","text-generation-language"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_10","uri":"capability://automation.workflow.efficient.inference.with.quantization.and.model.compression.support","name":"efficient inference with quantization and model compression support","description":"BLIP-2 supports inference optimization through integration with quantization frameworks (e.g., INT8 quantization via PyTorch) and model compression techniques that reduce memory footprint and latency. The frozen encoder and Q-Former can be quantized independently, and the frozen LLM can use existing LLM quantization methods (e.g., GPTQ, AWQ), enabling deployment on resource-constrained devices without full model fine-tuning.","intents":["I want to deploy BLIP-2 on edge devices or mobile with reduced memory footprint","I need to reduce inference latency for real-time applications (video processing, live chat)","I want to quantize the model without retraining or fine-tuning"],"best_for":["teams deploying BLIP-2 on edge devices (mobile, embedded systems)","practitioners needing real-time inference with latency constraints","developers optimizing inference cost in cloud deployments"],"limitations":["quantization typically reduces accuracy by 2-5% depending on quantization bit-width","INT8 quantization may not be supported on all hardware (requires specific GPU/CPU support)","no built-in quantization-aware training — post-training quantization may be suboptimal","quantized models are not compatible with standard PyTorch checkpoints — require custom loading","quantization benefits vary by component — frozen encoder quantization may have different accuracy impact than Q-Former"],"requires":["PyTorch 1.10.0+","Python 3.7+","quantization framework (e.g., PyTorch native quantization, GPTQ, AWQ)","optional: calibration dataset for post-training quantization","hardware supporting target quantization format (INT8, FP16, etc.)"],"input_types":["pre-trained BLIP-2 checkpoint (full precision)","optional: calibration images for quantization calibration"],"output_types":["quantized model checkpoint (reduced precision, smaller file size)","quantization metadata (scale factors, zero points for INT8)"],"categories":["automation-workflow","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_2","uri":"capability://image.visual.image.captioning.with.controlled.generation.length.and.style","name":"image captioning with controlled generation length and style","description":"BLIP-2 generates image captions by encoding images through the frozen vision encoder + Q-Former, then using the frozen LLM in generation mode with instruction prompts (e.g., 'A short description:' or 'A detailed description:') to control caption length and style. The model leverages the LLM's text generation capabilities with beam search or nucleus sampling to produce diverse captions from the same image without task-specific caption decoders.","intents":["I want to generate captions for images with controllable length (short vs detailed)","I need diverse caption variations from a single image for data augmentation","I want to caption images without training on caption datasets"],"best_for":["content creators needing automated image descriptions at scale","researchers evaluating caption quality across different instruction styles","teams building accessibility features (alt-text generation) without dataset-specific training"],"limitations":["captions often describe obvious visual content rather than providing novel insights","no explicit control over caption attributes (e.g., 'mention colors' or 'focus on objects')","generation quality depends heavily on instruction prompt engineering","inference latency ~1-2 seconds per image due to autoregressive decoding","hallucination risk: model may describe objects not present in image"],"requires":["PyTorch 1.10.0+","Python 3.7+","pre-trained BLIP-2 checkpoint","image input (224×224 or 336×336 RGB)","optional: custom instruction prompts for style control"],"input_types":["image (RGB, fixed resolution)","optional: instruction prompt (e.g., 'A short description:', 'A detailed description:')"],"output_types":["caption text (natural language string, 10-100 tokens typical)","generation scores (log-probability per caption for ranking)"],"categories":["image-visual","text-generation-language"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_3","uri":"capability://image.visual.multimodal.feature.extraction.for.downstream.tasks.via.unified.interface","name":"multimodal feature extraction for downstream tasks via unified interface","description":"BLIP-2 exposes a unified feature extraction interface (via LAVIS's load_model_and_preprocess() and model.extract_features() methods) that returns visual embeddings from the Q-Former output, enabling use of BLIP-2 as a feature extractor for image retrieval, classification, or clustering tasks. The extracted features are task-agnostic embeddings that can be fed to lightweight downstream classifiers or similarity metrics without full model fine-tuning.","intents":["I want to extract visual features from BLIP-2 for image retrieval or similarity search","I need to use BLIP-2 as a feature extractor for downstream classification tasks","I want to compare BLIP-2 features with other vision models (CLIP, ALBEF) using a consistent interface"],"best_for":["researchers benchmarking feature quality across different vision-language models","teams building image retrieval systems with pre-extracted embeddings","practitioners needing to extract features once and reuse them for multiple downstream tasks"],"limitations":["extracted features are task-agnostic and may not be optimal for specific downstream tasks","feature dimensionality fixed by Q-Former hidden size (256-768 depending on variant)","no built-in normalization or dimensionality reduction — downstream tasks must handle feature scaling","feature extraction requires full forward pass through frozen encoder + Q-Former (~100-200ms per image)"],"requires":["PyTorch 1.10.0+","Python 3.7+","LAVIS library installed (pip install salesforce-lavis)","pre-trained BLIP-2 checkpoint","image input (224×224 or 336×336 RGB)"],"input_types":["image (RGB, fixed resolution)","optional: batch of images (for efficient batch processing)"],"output_types":["visual embeddings (shape: [batch, num_queries, hidden_dim], e.g., [1, 32, 256])","optional: attention weights (Q-Former cross-attention maps)"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_4","uri":"capability://tool.use.integration.registry.based.model.composition.and.dynamic.loading","name":"registry-based model composition and dynamic loading","description":"BLIP-2 integrates with LAVIS's registry-based architecture (via load_model_and_preprocess() function) enabling dynamic model loading by name, automatic checkpoint downloading, and composition of different frozen encoders with different LLMs without code changes. The registry system maps model names (e.g., 'blip2_opt', 'blip2_llama') to configurations that specify encoder type, LLM type, and Q-Former parameters, enabling users to swap components via configuration files.","intents":["I want to load different BLIP-2 variants (OPT, Llama) with a single function call","I need to automatically download pre-trained checkpoints without manual URL handling","I want to experiment with different encoder-LLM combinations by changing config files, not code"],"best_for":["researchers rapidly prototyping different model configurations","teams deploying multiple BLIP-2 variants in production with centralized config management","developers building model selection logic that needs to support multiple architectures"],"limitations":["registry-based loading adds ~500ms-1s overhead for model initialization and checkpoint download","custom model variants require registering new config files in LAVIS codebase or external registry","no built-in model versioning — checkpoint URLs must be manually updated when new versions release","registry system couples model selection to LAVIS library — difficult to use custom model variants outside LAVIS"],"requires":["PyTorch 1.10.0+","Python 3.7+","LAVIS library installed (pip install salesforce-lavis)","internet connection for checkpoint downloading (or pre-cached checkpoints)","optional: custom config YAML files for non-standard variants"],"input_types":["model name string (e.g., 'blip2_opt', 'blip2_llama')","optional: model_type parameter (e.g., 'pretrain', 'vqa', 'caption')","optional: device specification (e.g., 'cuda:0', 'cpu')"],"output_types":["loaded model instance (nn.Module with forward() method)","preprocessor object (handles image/text normalization)"],"categories":["tool-use-integration","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_5","uri":"capability://data.processing.analysis.batch.image.preprocessing.with.automatic.normalization.and.resizing","name":"batch image preprocessing with automatic normalization and resizing","description":"BLIP-2 provides preprocessor objects (via LAVIS's load_model_and_preprocess() function) that handle image resizing, normalization, and batching according to the frozen encoder's requirements (e.g., CLIP ViT expects 224×224 with ImageNet normalization). The preprocessor applies these transformations consistently across images and returns PyTorch tensors ready for model inference, abstracting away encoder-specific preprocessing details.","intents":["I want to preprocess images consistently with the frozen encoder's requirements without manual normalization","I need to batch process multiple images with automatic resizing and padding","I want to avoid preprocessing bugs by using encoder-aware preprocessing instead of manual transforms"],"best_for":["developers building inference pipelines that need consistent image preprocessing","teams processing diverse image sizes and formats without manual transform logic","practitioners avoiding preprocessing bugs by delegating to encoder-aware preprocessors"],"limitations":["preprocessor is tied to specific frozen encoder — swapping encoders requires new preprocessor","fixed input resolution (224×224 or 336×336) may lose detail in high-resolution images","no built-in support for multi-resolution inputs or dynamic batching","preprocessing adds ~10-50ms latency per image depending on image size and format"],"requires":["PyTorch 1.10.0+","Python 3.7+","torchvision library (for image transforms)","LAVIS library installed","image input (PIL Image, numpy array, or file path)"],"input_types":["image (PIL Image, numpy array, or file path string)","optional: batch of images (list or tensor)"],"output_types":["preprocessed image tensor (shape: [batch, 3, height, width], float32)","optional: image metadata (original size, padding info)"],"categories":["data-processing-analysis","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_6","uri":"capability://automation.workflow.multi.task.training.with.unified.loss.functions.and.evaluation.metrics","name":"multi-task training with unified loss functions and evaluation metrics","description":"BLIP-2 supports training on multiple vision-language tasks (VQA, captioning, retrieval, classification) using a unified training pipeline (via LAVIS's Runner system) that applies task-specific loss functions (contrastive loss for retrieval, cross-entropy for VQA, language modeling loss for captioning) while sharing the frozen encoder and Q-Former backbone. The training system automatically selects appropriate loss functions and evaluation metrics based on task configuration, enabling multi-task learning without task-specific training code.","intents":["I want to train BLIP-2 on multiple tasks (VQA + captioning) simultaneously to improve generalization","I need to evaluate model performance on multiple benchmarks (VQA-v2, COCO Captions, Flickr30K) with consistent metrics","I want to leverage multi-task learning to improve zero-shot transfer without task-specific fine-tuning"],"best_for":["researchers exploring multi-task learning for vision-language models","teams training BLIP-2 variants on custom datasets with multiple task objectives","practitioners wanting to improve zero-shot performance through multi-task pre-training"],"limitations":["multi-task training requires careful loss weighting to balance task objectives — poor weighting degrades all tasks","training time increases linearly with number of tasks (e.g., 3 tasks = ~3x training time)","no built-in automatic loss weighting — requires manual tuning of task weights","evaluation on multiple tasks requires multiple datasets and metric implementations","frozen encoder limits task-specific visual adaptation — performance capped by encoder pre-training"],"requires":["PyTorch 1.10.0+","Python 3.7+","LAVIS library installed","multiple datasets (e.g., VQA-v2, COCO Captions, Flickr30K)","GPU with 16GB+ VRAM for multi-task training","task-specific configuration YAML files"],"input_types":["image (RGB, 224×224 or 336×336)","task-specific labels (questions+answers for VQA, captions for captioning, etc.)","task configuration (YAML specifying loss weights, metrics, datasets)"],"output_types":["trained Q-Former checkpoint (frozen encoder + LLM unchanged)","evaluation metrics per task (BLEU, CIDEr for captioning; Accuracy for VQA; Recall@K for retrieval)"],"categories":["automation-workflow","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_7","uri":"capability://data.processing.analysis.dataset.loading.and.automatic.downloading.with.unified.data.interface","name":"dataset loading and automatic downloading with unified data interface","description":"BLIP-2 integrates with LAVIS's dataset system (via load_dataset() function) that provides unified access to 20+ vision-language datasets (COCO, Flickr30K, Visual Genome, VQA-v2, etc.) with automatic downloading, caching, and annotation parsing. The dataset loader returns standardized data dictionaries with image paths, captions, questions, answers, etc., abstracting away dataset-specific format differences and enabling easy dataset switching for training and evaluation.","intents":["I want to load standard vision-language datasets without manually downloading and parsing annotations","I need to switch between datasets (COCO, Flickr30K) without changing data loading code","I want to access multiple splits (train, val, test) with consistent interfaces"],"best_for":["researchers benchmarking BLIP-2 on standard datasets without dataset-specific preprocessing","teams training on multiple datasets sequentially or in multi-task settings","practitioners avoiding dataset format bugs by using standardized data loaders"],"limitations":["automatic downloading requires significant disk space (COCO ~20GB, Visual Genome ~100GB+)","dataset loading adds ~1-5 seconds overhead per epoch due to annotation parsing","custom datasets require manual registration in LAVIS dataset registry","no built-in support for streaming datasets or on-the-fly augmentation beyond image preprocessing","dataset splits are fixed — no built-in support for custom train/val/test splits"],"requires":["PyTorch 1.10.0+","Python 3.7+","LAVIS library installed","internet connection for dataset downloading","sufficient disk space (20GB-100GB+ depending on datasets)","optional: custom dataset configuration YAML files"],"input_types":["dataset name string (e.g., 'coco_caption', 'vqa_v2', 'flickr30k')","optional: split parameter (e.g., 'train', 'val', 'test')","optional: dataset configuration (YAML specifying paths, splits, annotations)"],"output_types":["dataset object (iterable returning data dictionaries)","data dictionary (keys: 'image', 'caption'/'question'/'answer', 'image_id', etc.)","optional: dataset metadata (size, splits, annotation format)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_8","uri":"capability://image.visual.instruction.tuned.visual.reasoning.with.in.context.learning","name":"instruction-tuned visual reasoning with in-context learning","description":"BLIP-2 (via InstructBLIP variant) supports instruction-tuned visual reasoning where the model receives natural language instructions (e.g., 'Describe the objects in the image', 'Count the red objects') and generates responses following those instructions. The model leverages the frozen LLM's instruction-following capabilities and in-context learning (few-shot examples in the prompt) to adapt to new reasoning tasks without fine-tuning, enabling zero-shot generalization to unseen instruction types.","intents":["I want to perform diverse visual reasoning tasks (counting, localization, description) with natural language instructions","I need to adapt the model to new reasoning tasks using few-shot examples in the prompt","I want to leverage instruction-following without task-specific fine-tuning"],"best_for":["researchers exploring instruction-tuned vision-language models","teams building flexible visual reasoning systems that handle diverse task types","practitioners needing zero-shot adaptation to new reasoning tasks via prompting"],"limitations":["instruction-following quality depends heavily on instruction clarity and LLM's instruction-following ability","in-context learning requires careful prompt engineering — poor examples degrade performance","no explicit grounding mechanism — model may hallucinate answers without visual grounding","inference latency ~2-4 seconds per image due to LLM generation with long context","performance on complex reasoning (multi-step logic, spatial relationships) remains limited"],"requires":["PyTorch 1.10.0+","Python 3.7+","InstructBLIP checkpoint (instruction-tuned variant)","image input (224×224 or 336×336 RGB)","natural language instruction string","optional: few-shot examples for in-context learning"],"input_types":["image (RGB, fixed resolution)","instruction text (natural language string, e.g., 'Describe the objects in the image')","optional: few-shot examples (list of (instruction, response) pairs)"],"output_types":["response text (natural language string following instruction)","optional: confidence scores or uncertainty estimates"],"categories":["image-visual","text-generation-language","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__cap_9","uri":"capability://image.visual.cross.modal.retrieval.with.contrastive.learning.embeddings","name":"cross-modal retrieval with contrastive learning embeddings","description":"BLIP-2 supports image-text retrieval by training visual and text embeddings in a shared space using contrastive loss (InfoNCE), enabling similarity-based matching between images and text descriptions. The model encodes images through the frozen encoder + Q-Former and text through a frozen text encoder (e.g., BERT), then computes similarity scores via dot product in the shared embedding space, enabling both image-to-text and text-to-image retrieval without task-specific ranking heads.","intents":["I want to retrieve images matching text queries or vice versa using learned similarity metrics","I need to build image-text retrieval systems without training task-specific ranking models","I want to leverage contrastive learning to align visual and textual representations"],"best_for":["teams building image search systems with text queries","researchers evaluating cross-modal alignment in vision-language models","practitioners needing efficient retrieval without ranking networks"],"limitations":["contrastive learning requires large batch sizes (256+) for effective negative sampling — small batches degrade performance","retrieval quality depends on text description quality — poor captions hurt alignment","no explicit ranking mechanism — similarity scores are raw dot products without learned ranking","inference requires computing similarity for all gallery items — O(n) complexity for n items","frozen text encoder limits text-specific adaptation — performance capped by pre-training"],"requires":["PyTorch 1.10.0+","Python 3.7+","BLIP-2 checkpoint trained with contrastive loss","image input (224×224 or 336×336 RGB)","text input (caption or description string)","optional: pre-computed image/text embeddings for efficient retrieval"],"input_types":["image (RGB, fixed resolution) or image embeddings (pre-computed)","text (caption string) or text embeddings (pre-computed)"],"output_types":["similarity score (float, typically 0-1 after softmax)","ranked list of matches (image-text pairs sorted by similarity)","optional: embedding vectors for downstream similarity computation"],"categories":["image-visual","search-retrieval","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"blip-2__headline","uri":"capability://model.training.multimodal.vision.language.model","name":"multimodal vision-language model","description":"BLIP-2 is a state-of-the-art multimodal vision-language model that enables efficient visual question answering, image captioning, and multimodal reasoning, bridging image encoders with large language models.","intents":["best multimodal model","multimodal model for visual question answering","image captioning model","how to use BLIP-2 for image tasks","top vision-language models for research"],"best_for":["researchers in AI","developers in computer vision"],"limitations":["requires understanding of deep learning"],"requires":["Python 3.7+","PyTorch 1.10.0+"],"input_types":["images","text"],"output_types":["text","captions","answers"],"categories":["model-training"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":57,"verified":false,"data_access_risk":"low","permissions":["PyTorch 1.10.0+","Python 3.7+","pre-trained frozen image encoder (CLIP ViT-L/14, EVA-CLIP, or equivalent)","target LLM with known embedding dimension (OPT, Llama, etc.)","pre-trained BLIP-2 checkpoint (Q-Former + frozen encoder + frozen LLM)","image input (224×224 or 336×336 RGB)","optional: custom instruction templates for domain-specific prompting","quantization framework (e.g., PyTorch native quantization, GPTQ, AWQ)","optional: calibration dataset for post-training quantization","hardware supporting target quantization format (INT8, FP16, etc.)"],"failure_modes":["frozen encoders cannot adapt to domain-specific visual patterns — performance capped by pre-training distribution","Q-Former adds ~50-100ms latency per image due to cross-attention computation over all image patches","requires careful tuning of query token count (32-256) to balance expressiveness vs computational cost","no built-in mechanism for multi-resolution image inputs — fixed input size inherited from frozen encoder","zero-shot performance degrades on complex reasoning questions requiring multi-step logic","LLM generation can hallucinate plausible-sounding but incorrect answers due to limited visual grounding","inference latency ~1-3 seconds per image due to autoregressive LLM decoding","no built-in confidence scores or uncertainty quantification for answer reliability","performance bounded by frozen LLM's knowledge cutoff and instruction-following ability","quantization typically reduces accuracy by 2-5% depending on quantization bit-width","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.9,"ecosystem":0.39999999999999997,"match_graph":0.25,"freshness":0.52,"weights":{"adoption":0.35,"quality":0.2,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:04.690Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=blip-2","compare_url":"https://unfragile.ai/compare?artifact=blip-2"}},"signature":"1jmwGNGMl4YcjhBl6g2tkIgWEHMA8IgN82a9+S6wWRj13oRx2ZoO7y6L5eAto53AhhBgctwZXAAsN7mYFVt1AQ==","signedAt":"2026-06-22T03:53:41.606Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/blip-2","artifact":"https://unfragile.ai/blip-2","verify":"https://unfragile.ai/api/v1/verify?slug=blip-2","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}