{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"openrouter-meta-llama-llama-guard-4-12b","slug":"meta-llama-llama-guard-4-12b","name":"Meta: Llama Guard 4 12B","type":"model","url":"https://openrouter.ai/models/meta-llama~llama-guard-4-12b","page_url":"https://unfragile.ai/meta-llama-llama-guard-4-12b","categories":["model-training"],"tags":["meta-llama","api-access","text","image"],"pricing":{"model":"paid","free":false,"starting_price":"$1.80e-7 per prompt token"},"status":"active","verified":false},"capabilities":[{"id":"openrouter-meta-llama-llama-guard-4-12b__cap_0","uri":"capability://safety.moderation.multimodal.content.safety.classification","name":"multimodal content safety classification","description":"Classifies both text and image inputs against a taxonomy of unsafe content categories (violence, sexual content, hate speech, etc.) using a fine-tuned Llama 4 Scout backbone with multimodal encoders. The model processes inputs through separate text and vision pathways, then aggregates representations to produce safety risk scores and category labels. Built on instruction-tuned safety classification patterns established in Llama Guard 3, extended with visual understanding for detecting unsafe imagery.","intents":["I need to filter user-generated content (text and images) before it reaches my LLM application","I want to classify incoming prompts and responses for safety violations before deployment","I need to audit image and text datasets for unsafe content at scale","I want to implement content moderation guardrails that understand both language and visual context"],"best_for":["LLM application builders implementing safety layers before inference","Content moderation teams processing mixed-media user submissions","Enterprise teams building compliant AI systems with audit trails","Researchers evaluating safety properties of multimodal datasets"],"limitations":["Classification taxonomy is fixed to Meta's predefined categories — cannot customize safety definitions without retraining","Multimodal processing adds ~500-800ms latency per request vs text-only classifiers due to vision encoding","No explanation/reasoning output — returns only category labels and confidence scores without justification","Requires API calls through OpenRouter or self-hosted deployment; no local quantized versions documented","May have lower accuracy on edge-case content or domain-specific unsafe patterns (e.g., financial fraud, medical misinformation)"],"requires":["OpenRouter API key or self-hosted inference infrastructure (vLLM, TGI, or similar)","Input text up to model context length (likely 8K tokens based on Llama 4 Scout specs)","Images in standard formats (JPEG, PNG, WebP) with reasonable resolution (tested on 224-1024px)","Network connectivity for API calls or GPU memory (12B model requires ~24-28GB VRAM for inference)"],"input_types":["text (prompts, responses, user messages)","image (PNG, JPEG, WebP formats)","mixed (text + image pairs for joint classification)"],"output_types":["structured JSON with category labels and confidence scores","risk severity level (safe/low-risk/high-risk)","category breakdown (e.g., {violence: 0.92, sexual: 0.15, hate_speech: 0.03})"],"categories":["safety-moderation","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"openrouter-meta-llama-llama-guard-4-12b__cap_1","uri":"capability://safety.moderation.taxonomy.based.unsafe.content.categorization","name":"taxonomy-based unsafe content categorization","description":"Maps input content to a predefined taxonomy of unsafe categories (violence, sexual content, hate speech, illegal activities, etc.) using instruction-tuned classification. The model was fine-tuned on safety-labeled datasets to recognize nuanced violations within each category, producing granular category-level confidence scores rather than binary safe/unsafe decisions. Supports hierarchical reasoning about content severity across multiple harm dimensions simultaneously.","intents":["I need to know which specific safety categories a piece of content violates, not just whether it's safe","I want to apply different policies to different violation types (e.g., quarantine violence but reject sexual content)","I need to generate audit logs showing exactly which safety rules were triggered","I want to fine-tune my moderation thresholds per category based on my application's risk tolerance"],"best_for":["Moderation teams needing detailed violation reports for human review","Applications with category-specific policies (e.g., stricter on violence, lenient on political speech)","Compliance-heavy industries (fintech, healthcare) requiring audit trails of safety decisions","Researchers studying safety taxonomy effectiveness across domains"],"limitations":["Taxonomy is fixed and opaque — no way to add custom categories or reweight existing ones without retraining","Category definitions may not align with your application's specific safety policies","Confidence scores are relative, not calibrated probabilities — threshold selection requires empirical tuning","No explanation of which phrases/regions triggered each category — only aggregate scores per category"],"requires":["Understanding of Meta's safety taxonomy (documentation should specify exact categories)","Threshold tuning process to determine acceptable confidence levels per category","API integration layer to map model outputs to your application's policy decisions"],"input_types":["text (full prompts, user messages, generated responses)","image (visual content to classify against violence, sexual, hate speech categories)"],"output_types":["category scores object (e.g., {violence: 0.87, sexual_content: 0.12, hate_speech: 0.04, illegal_activity: 0.01})","category labels (list of triggered categories above threshold)","severity level per category (optional, if model outputs graduated risk levels)"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"openrouter-meta-llama-llama-guard-4-12b__cap_2","uri":"capability://safety.moderation.instruction.tuned.safety.reasoning","name":"instruction-tuned safety reasoning","description":"Applies instruction-following capabilities from the Llama 4 Scout base model to safety classification tasks, enabling the model to understand nuanced safety instructions and apply them consistently. The fine-tuning process teaches the model to reason about context, intent, and harm potential rather than matching keywords. This allows classification of subtle violations (e.g., veiled threats, coded hate speech) that simple pattern matching would miss.","intents":["I need to detect sophisticated or obfuscated unsafe content that uses indirect language or coded references","I want the safety classifier to understand context and intent, not just flag keywords","I need consistent safety decisions across paraphrases and variations of the same harmful content","I want to reduce false positives from benign uses of potentially sensitive terms"],"best_for":["Applications with sophisticated users who might attempt to evade keyword-based filters","Domains requiring contextual understanding (e.g., medical discussions vs. self-harm content)","Teams building safety systems that need to handle edge cases and ambiguous content","Multilingual or cross-cultural applications where keyword filtering fails"],"limitations":["Instruction-tuned models can be adversarially prompted or jailbroken — no guarantee against determined evasion","Reasoning capability adds latency (~500ms+) compared to lightweight pattern matchers","Fine-tuning data biases may cause inconsistent classification across demographic groups or cultural contexts","No interpretability into which aspects of the input triggered the classification decision"],"requires":["Acceptance that no classifier is perfect — requires human review for edge cases","Baseline understanding of your domain's safety concerns to set appropriate thresholds","Monitoring and feedback loops to detect and correct systematic biases"],"input_types":["text with contextual information (full conversation, user profile, metadata)","image with surrounding text context"],"output_types":["safety classification with confidence scores","category labels reflecting reasoned judgment about intent and harm"],"categories":["safety-moderation","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"openrouter-meta-llama-llama-guard-4-12b__cap_3","uri":"capability://safety.moderation.batch.content.moderation.via.api","name":"batch content moderation via api","description":"Exposes safety classification through OpenRouter's API, enabling batch processing of content at scale without managing inference infrastructure. Requests are routed through OpenRouter's load-balanced endpoints, supporting concurrent classification of multiple text/image inputs. The API abstracts away model serving complexity, providing a simple HTTP interface with standard request/response formats.","intents":["I want to moderate user-generated content in real-time without running my own GPU infrastructure","I need to process large batches of historical content for compliance audits","I want to integrate safety classification into my application's request pipeline with minimal engineering overhead","I need to scale moderation without managing model deployment, versioning, or updates"],"best_for":["Startups and small teams without ML infrastructure expertise","Applications with variable moderation load (spiky traffic patterns)","Teams prioritizing time-to-market over cost optimization","Organizations in regulated industries requiring audit trails and SLA guarantees"],"limitations":["API latency (~500-1000ms per request) makes real-time moderation of high-frequency streams challenging","Per-request pricing adds up quickly at scale — batch processing is cheaper but introduces latency","Vendor lock-in to OpenRouter's infrastructure and pricing model","No local fallback if API is unavailable — requires graceful degradation logic in your application","Rate limits may throttle burst traffic during traffic spikes"],"requires":["OpenRouter API key with active billing","HTTP client library (any language)","Network connectivity and reasonable latency tolerance (~1s per request)","Error handling for API failures, rate limits, and timeouts"],"input_types":["text (JSON payload with text field)","image (base64-encoded or URL reference)","mixed (text + image in single request)"],"output_types":["JSON response with category scores and risk level","HTTP status codes (200 for success, 429 for rate limit, 5xx for server errors)"],"categories":["safety-moderation","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"openrouter-meta-llama-llama-guard-4-12b__cap_4","uri":"capability://safety.moderation.image.safety.classification.with.visual.understanding","name":"image safety classification with visual understanding","description":"Processes images through a vision encoder integrated into the Llama 4 Scout backbone to detect unsafe visual content (violence, sexual imagery, hate symbols, etc.). The vision pathway extracts visual features that are then fused with text embeddings for joint classification. This enables detection of unsafe imagery even without accompanying text, and allows the model to understand visual context when classifying text+image pairs together.","intents":["I need to filter user-uploaded images for violence, sexual content, and hate symbols before they're stored or displayed","I want to detect unsafe imagery in social media feeds or user galleries at scale","I need to understand whether an image+caption pair together constitute unsafe content (e.g., hateful meme)","I want to audit image datasets for unsafe visual patterns"],"best_for":["Social media platforms and content-sharing applications","E-commerce platforms moderating user-generated product images","Content moderation teams handling mixed-media submissions","Researchers studying visual safety and bias in image classifiers"],"limitations":["Vision encoder adds ~300-500ms latency per image compared to text-only classification","Accuracy varies with image quality, resolution, and artistic style — may struggle with abstract or stylized content","No bounding box or region highlighting — only image-level safety scores, not localization of unsafe regions","May have dataset bias toward certain demographics or cultural contexts in training data","Cannot detect context-dependent harms (e.g., image of a weapon in a historical museum vs. threat context)"],"requires":["Images in standard formats (JPEG, PNG, WebP)","Reasonable image resolution (tested on 224-1024px; very small or very large images may degrade accuracy)","GPU inference infrastructure or API access (image processing is compute-intensive)"],"input_types":["image (JPEG, PNG, WebP formats)","image + text (for joint classification of memes, captioned images, etc.)"],"output_types":["image-level safety scores (violence, sexual_content, hate_speech, etc.)","risk level (safe/low-risk/high-risk)","category labels for triggered violations"],"categories":["safety-moderation","image-visual"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":23,"verified":false,"data_access_risk":"low","permissions":["OpenRouter API key or self-hosted inference infrastructure (vLLM, TGI, or similar)","Input text up to model context length (likely 8K tokens based on Llama 4 Scout specs)","Images in standard formats (JPEG, PNG, WebP) with reasonable resolution (tested on 224-1024px)","Network connectivity for API calls or GPU memory (12B model requires ~24-28GB VRAM for inference)","Understanding of Meta's safety taxonomy (documentation should specify exact categories)","Threshold tuning process to determine acceptable confidence levels per category","API integration layer to map model outputs to your application's policy decisions","Acceptance that no classifier is perfect — requires human review for edge cases","Baseline understanding of your domain's safety concerns to set appropriate thresholds","Monitoring and feedback loops to detect and correct systematic biases"],"failure_modes":["Classification taxonomy is fixed to Meta's predefined categories — cannot customize safety definitions without retraining","Multimodal processing adds ~500-800ms latency per request vs text-only classifiers due to vision encoding","No explanation/reasoning output — returns only category labels and confidence scores without justification","Requires API calls through OpenRouter or self-hosted deployment; no local quantized versions documented","May have lower accuracy on edge-case content or domain-specific unsafe patterns (e.g., financial fraud, medical misinformation)","Taxonomy is fixed and opaque — no way to add custom categories or reweight existing ones without retraining","Category definitions may not align with your application's specific safety policies","Confidence scores are relative, not calibrated probabilities — threshold selection requires empirical tuning","No explanation of which phrases/regions triggered each category — only aggregate scores per category","Instruction-tuned models can be adversarially prompted or jailbroken — no guarantee against determined evasion","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.05,"quality":0.35,"ecosystem":0.27,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.35,"quality":0.2,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:24.484Z","last_scraped_at":"2026-05-03T15:20:45.776Z","last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=meta-llama-llama-guard-4-12b","compare_url":"https://unfragile.ai/compare?artifact=meta-llama-llama-guard-4-12b"}},"signature":"3GfXiT6Bsrc3RM81acC2kOpiy37YrVNCOlCX7vflce51XW4tXpGJkGuqWkG6dL+x+mxK92yVnllQcHtMpWYGDA==","signedAt":"2026-06-23T00:35:52.752Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/meta-llama-llama-guard-4-12b","artifact":"https://unfragile.ai/meta-llama-llama-guard-4-12b","verify":"https://unfragile.ai/api/v1/verify?slug=meta-llama-llama-guard-4-12b","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}