{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"prompt-guard","slug":"prompt-guard","name":"Prompt Guard","type":"model","url":"https://github.com/meta-llama/PurpleLlama","page_url":"https://unfragile.ai/prompt-guard","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"prompt-guard__cap_0","uri":"capability://safety.moderation.binary.prompt.injection.classification.with.transformer.based.detection","name":"binary prompt injection classification with transformer-based detection","description":"Prompt Guard implements a lightweight transformer-based binary classifier that analyzes input text to detect prompt injection and jailbreak attempts before they reach the target LLM. The model uses a fine-tuned encoder architecture trained on adversarial prompt datasets to distinguish between benign user inputs and malicious injection patterns, operating as a preprocessing filter that can be deployed independently of the underlying LLM provider.","intents":["I need to filter user inputs for prompt injection attacks before sending them to my LLM API","I want to add a security layer that detects jailbreak attempts in real-time without modifying my LLM","I need to understand whether an input contains adversarial patterns designed to manipulate my model"],"best_for":["LLM application developers building production systems with untrusted user inputs","Teams deploying multi-tenant LLM services requiring input validation","Security-conscious organizations implementing defense-in-depth for generative AI"],"limitations":["Binary classification only — returns true/false without confidence scores or attack type categorization","Trained primarily on English prompt injection patterns; multilingual coverage may be limited","Cannot detect novel zero-day injection techniques not represented in training data","Requires integration into request pipeline; no built-in rate limiting or logging"],"requires":["PyTorch or compatible inference runtime","Model weights downloaded from Meta's model hub (approximately 1-2GB)","Python 3.8+ for inference wrapper","Sufficient GPU memory or CPU for real-time inference (model size ~1B parameters)"],"input_types":["text (raw user prompts, chat messages, API requests)"],"output_types":["boolean (injection detected: true/false)","optional: confidence score or logits for threshold tuning"],"categories":["safety-moderation","security-filtering"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"prompt-guard__cap_1","uri":"capability://safety.moderation.multilingual.prompt.injection.detection.with.machine.translated.adversarial.datasets","name":"multilingual prompt injection detection with machine-translated adversarial datasets","description":"Prompt Guard extends injection detection across multiple languages by leveraging machine-translated versions of adversarial prompt datasets from the CyberSecEval benchmarks. The model processes non-English inputs through the same transformer encoder, enabling detection of injection attempts crafted in languages other than English without requiring separate language-specific models or retraining.","intents":["I operate a global LLM service and need to detect prompt injections in multiple languages","I want to ensure my security filtering works for non-English users without deploying separate models","I need to validate that adversarial patterns translate consistently across languages"],"best_for":["International SaaS platforms serving multilingual user bases","Organizations with compliance requirements for non-English markets","Teams evaluating cross-lingual robustness of security measures"],"limitations":["Multilingual coverage depends on machine translation quality; semantic drift in translation may reduce detection accuracy","No explicit language identification — model processes all inputs with same weights regardless of language","Training data is machine-translated, not native-speaker crafted, potentially missing language-specific attack patterns","Performance may degrade for low-resource languages or code-mixed inputs"],"requires":["Same model weights as English version (no separate language models)","Optional: language detection preprocessing to log or route by language","Understanding that multilingual training is via translation, not native data"],"input_types":["text in multiple languages (English, Spanish, French, German, Chinese, Japanese, etc.)"],"output_types":["boolean (injection detected: true/false)","optional: language tag for logging/analysis"],"categories":["safety-moderation","security-filtering"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"prompt-guard__cap_2","uri":"capability://safety.moderation.integration.with.llamafirewall.scanner.pipeline.for.layered.defense","name":"integration with llamafirewall scanner pipeline for layered defense","description":"Prompt Guard operates as a component within the broader LlamaFirewall security framework, which orchestrates multiple scanner modules (including Prompt Guard, Llama Guard for output filtering, and CodeShield for code-specific threats) into a coordinated defense pipeline. The architecture allows Prompt Guard to be deployed as the first-stage input filter, with results passed to downstream scanners for comprehensive threat assessment across the full LLM interaction lifecycle.","intents":["I want to deploy Prompt Guard as part of a comprehensive security stack, not in isolation","I need to understand how prompt injection detection fits into broader LLM safeguarding","I want to coordinate input filtering with output filtering and code-specific threat detection"],"best_for":["Enterprise teams building production LLM systems with multiple security layers","Organizations implementing defense-in-depth strategies across input/output/code execution","Teams using the full Purple Llama ecosystem for end-to-end security"],"limitations":["Requires understanding of LlamaFirewall architecture and scanner composition","Integration adds orchestration overhead; no built-in performance optimization across scanners","Scanner pipeline is sequential by default; parallel execution requires custom implementation","Coordination logic between scanners (e.g., early exit on high-confidence injection) must be implemented by user"],"requires":["LlamaFirewall framework installed and configured","Understanding of scanner component interfaces and data flow","Configuration of scanner ordering and decision logic","Potentially: custom wrapper code to integrate with existing LLM serving infrastructure"],"input_types":["text (user prompts)","structured metadata (user ID, session context, request metadata)"],"output_types":["scanner result object with threat classification, confidence, and downstream scanner recommendations"],"categories":["safety-moderation","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"prompt-guard__cap_3","uri":"capability://safety.moderation.evaluation.against.cyberseceval.v2.benchmark.datasets.for.attack.coverage","name":"evaluation against cyberseceval v2+ benchmark datasets for attack coverage","description":"Prompt Guard's detection capabilities are grounded in and evaluated against the CyberSecEval benchmark suite, which includes MITRE-mapped prompt injection tests, visual prompt injection attacks, and adversarial patterns from multiple attack categories. The model's performance is measured against these standardized benchmarks, providing transparency into which attack types it can detect and which remain out-of-scope, enabling users to understand coverage gaps and make informed deployment decisions.","intents":["I need to understand what types of prompt injection attacks Prompt Guard can actually detect","I want to evaluate Prompt Guard's performance against industry-standard security benchmarks","I need to know which attack patterns are covered and which require additional defenses"],"best_for":["Security teams evaluating LLM safeguards for procurement or deployment decisions","Researchers benchmarking prompt injection detection across multiple models","Organizations with compliance requirements to document security evaluation methodology"],"limitations":["Benchmark coverage is not exhaustive; novel attack patterns not in CyberSecEval may evade detection","Benchmark performance metrics (precision/recall) may not translate directly to production performance with different input distributions","Visual prompt injection detection requires integration with image processing; text-only Prompt Guard does not cover visual attacks","Benchmarks are static; adversarial attack landscape evolves faster than benchmark updates"],"requires":["Access to CyberSecEval benchmark datasets (publicly available in repository)","Understanding of benchmark methodology and threat model assumptions","Evaluation infrastructure to run Prompt Guard against benchmark test cases","Interpretation of precision/recall/F1 metrics in context of your threat model"],"input_types":["benchmark test cases (adversarial prompts from CyberSecEval)"],"output_types":["evaluation metrics (precision, recall, F1, confusion matrix)","per-attack-category performance breakdown"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"prompt-guard__cap_4","uri":"capability://safety.moderation.lightweight.inference.for.low.latency.preprocessing.in.request.pipelines","name":"lightweight inference for low-latency preprocessing in request pipelines","description":"Prompt Guard is optimized as a lightweight model (~1B parameters) designed for real-time inference in request preprocessing pipelines, with minimal latency overhead added to LLM API calls. The model uses efficient transformer architecture patterns (likely distilled or pruned variants) to enable sub-100ms inference on standard hardware, allowing deployment as a synchronous preprocessing step without requiring asynchronous queuing or significant infrastructure investment.","intents":["I need to add security filtering to my LLM API without adding significant latency to user requests","I want to deploy Prompt Guard on modest hardware (CPU or single GPU) without infrastructure scaling","I need to understand the performance cost of adding prompt injection detection to my pipeline"],"best_for":["High-throughput LLM services where latency is critical (sub-500ms total response time)","Teams with limited GPU infrastructure or budget constraints","Edge deployment scenarios where model size and inference speed are constrained"],"limitations":["Lightweight architecture may trade off detection accuracy for speed; no published ablation studies on this tradeoff","Inference latency varies by hardware; CPU inference may be 5-10x slower than GPU","Batch inference is more efficient than single-request inference; high-concurrency scenarios may require request batching logic","No built-in caching of repeated prompts; identical inputs are re-evaluated each time"],"requires":["GPU with 2-4GB VRAM for optimal inference speed, or CPU with sufficient cores for acceptable latency","Inference framework (PyTorch, ONNX, or TensorRT for optimization)","Integration into request pipeline with latency monitoring","Optional: batching infrastructure for high-concurrency scenarios"],"input_types":["text (user prompts, typically 100-2000 tokens)"],"output_types":["boolean classification (typically within 50-200ms)"],"categories":["safety-moderation","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"prompt-guard__cap_5","uri":"capability://safety.moderation.configurable.detection.thresholds.for.precision.recall.tradeoff.tuning","name":"configurable detection thresholds for precision-recall tradeoff tuning","description":"Prompt Guard outputs logits or confidence scores (in addition to binary classification) that can be thresholded to adjust the precision-recall tradeoff based on application requirements. Users can configure detection sensitivity to prioritize either false-positive reduction (higher threshold, fewer blocks) or false-negative reduction (lower threshold, more blocks), enabling tuning for specific threat models and user experience requirements without retraining.","intents":["I want to tune Prompt Guard's sensitivity to match my risk tolerance and false-positive budget","I need to reduce false positives that are blocking legitimate user requests","I want to increase detection sensitivity for high-security applications where false negatives are costly"],"best_for":["Teams with specific precision/recall requirements based on their threat model","Applications where false positives have high user experience cost (e.g., customer support chatbots)","Security-critical applications where false negatives are unacceptable"],"limitations":["Threshold tuning requires labeled validation data to measure precision/recall; no automated tuning algorithm provided","Optimal threshold varies by input distribution; production threshold may differ from benchmark evaluation","No per-user or per-context threshold configuration; single global threshold applies to all requests","Threshold changes require redeployment or runtime configuration changes; no A/B testing framework provided"],"requires":["Access to model confidence scores (not just binary output)","Validation dataset with labeled examples to measure precision/recall at different thresholds","Monitoring infrastructure to track false-positive and false-negative rates in production","Decision logic for what to do with borderline cases (e.g., log for review, require user confirmation)"],"input_types":["model logits or confidence scores"],"output_types":["configurable threshold value (typically 0.0-1.0)","per-request decision: block, allow, or escalate"],"categories":["safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"prompt-guard__cap_6","uri":"capability://safety.moderation.model.card.documentation.with.threat.model.and.evaluation.methodology","name":"model card documentation with threat model and evaluation methodology","description":"Prompt Guard includes comprehensive model card documentation (MODEL_CARD.md in repository) that specifies the threat model, training data sources, evaluation methodology, performance metrics, and known limitations. This documentation enables users to understand the model's design assumptions, evaluate its suitability for their use case, and make informed decisions about deployment and complementary safeguards.","intents":["I need to understand what threat model Prompt Guard was designed for","I want to evaluate whether Prompt Guard is appropriate for my security requirements","I need documentation for compliance or security audit purposes"],"best_for":["Security teams evaluating safeguards for procurement or deployment","Compliance officers documenting security measures for audits","Researchers understanding model design and limitations"],"limitations":["Model card is static documentation; does not update as new attacks emerge","Documentation may not cover all edge cases or failure modes discovered in production","Model card describes intended use; actual deployment may differ and introduce new risks","No automated validation that deployed model matches documented specifications"],"requires":["Access to repository and model card file","Understanding of threat modeling and security evaluation methodology","Ability to interpret performance metrics in context of your threat model"],"input_types":["model card documentation (markdown)"],"output_types":["threat model specification, training data description, evaluation methodology, performance metrics, limitations"],"categories":["safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"prompt-guard__cap_7","uri":"capability://safety.moderation.open.source.model.weights.and.inference.code.for.self.hosted.deployment","name":"open-source model weights and inference code for self-hosted deployment","description":"Prompt Guard is released as open-source with publicly available model weights and inference code, enabling users to download, inspect, and deploy the model in their own infrastructure without reliance on external APIs or vendor lock-in. The model can be deployed on-premises, in private cloud environments, or at the edge, with full control over data flow and inference infrastructure.","intents":["I need to deploy Prompt Guard in my own infrastructure without sending data to external APIs","I want to inspect the model weights and inference code for security or compliance reasons","I need to deploy Prompt Guard in an air-gapped or offline environment"],"best_for":["Enterprise organizations with data residency or privacy requirements","Teams with existing ML infrastructure and expertise to manage model deployment","Security-conscious organizations that require code inspection and audit trails"],"limitations":["Requires ML infrastructure and operational expertise; no managed service option","Users are responsible for model updates, security patches, and monitoring","No official support or SLA; community support only","Model weights are large (~1-2GB); requires storage and bandwidth for distribution"],"requires":["ML infrastructure (GPU or CPU) for inference","Python environment with PyTorch or compatible inference framework","Operational expertise for model deployment, monitoring, and updates","Storage for model weights and inference logs"],"input_types":["model weights (PyTorch format)","inference code (Python)"],"output_types":["deployed model in user's infrastructure"],"categories":["safety-moderation","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"prompt-guard__cap_8","uri":"capability://safety.moderation.integration.with.llm.provider.abstraction.layer.for.multi.provider.evaluation","name":"integration with llm provider abstraction layer for multi-provider evaluation","description":"Prompt Guard is evaluated and can be integrated with the Purple Llama LLM abstraction layer, which provides unified interfaces to multiple LLM providers (OpenAI, Anthropic, Google, Together, Ollama). This enables consistent evaluation of prompt injection detection across different LLM backends and facilitates deployment in heterogeneous environments where multiple LLM providers are used.","intents":["I use multiple LLM providers and need consistent prompt injection detection across all of them","I want to evaluate Prompt Guard's effectiveness with different LLM backends","I need to integrate Prompt Guard into a multi-provider LLM orchestration layer"],"best_for":["Organizations using multiple LLM providers (OpenAI, Anthropic, Google, etc.)","Teams building LLM orchestration platforms with provider abstraction","Researchers evaluating prompt injection detection across different models"],"limitations":["Abstraction layer adds complexity; requires understanding of provider-specific APIs and error handling","Prompt Guard detection is provider-agnostic; different providers may have different vulnerabilities to injection attacks","No built-in provider-specific tuning; single Prompt Guard model used for all providers","Evaluation results may vary by provider; no guarantee that detection effectiveness is consistent"],"requires":["Understanding of LLM provider APIs (OpenAI, Anthropic, Google, Together, Ollama)","API keys or credentials for providers being used","Integration with Purple Llama's LLM abstraction layer","Evaluation infrastructure to test across multiple providers"],"input_types":["text prompts","provider configuration (API keys, model names, parameters)"],"output_types":["injection detection results with provider metadata"],"categories":["safety-moderation","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"prompt-guard__headline","uri":"capability://safety.moderation.prompt.injection.detection.model","name":"prompt injection detection model","description":"A lightweight classifier model designed to detect prompt injection and jailbreak attempts in LLM inputs, enhancing the security of generative AI applications.","intents":["best prompt injection detection model","prompt injection detection for LLMs","how to secure LLMs against prompt injection","lightweight classifiers for AI security","best practices for prompt injection prevention"],"best_for":["developers building LLM applications","security teams evaluating AI systems"],"limitations":["may not cover all types of vulnerabilities","requires integration into existing LLM workflows"],"requires":["LLM application for deployment"],"input_types":["text inputs for LLMs"],"output_types":["classification results indicating prompt safety"],"categories":["safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":56,"verified":false,"data_access_risk":"high","permissions":["PyTorch or compatible inference runtime","Model weights downloaded from Meta's model hub (approximately 1-2GB)","Python 3.8+ for inference wrapper","Sufficient GPU memory or CPU for real-time inference (model size ~1B parameters)","Same model weights as English version (no separate language models)","Optional: language detection preprocessing to log or route by language","Understanding that multilingual training is via translation, not native data","LlamaFirewall framework installed and configured","Understanding of scanner component interfaces and data flow","Configuration of scanner ordering and decision logic"],"failure_modes":["Binary classification only — returns true/false without confidence scores or attack type categorization","Trained primarily on English prompt injection patterns; multilingual coverage may be limited","Cannot detect novel zero-day injection techniques not represented in training data","Requires integration into request pipeline; no built-in rate limiting or logging","Multilingual coverage depends on machine translation quality; semantic drift in translation may reduce detection accuracy","No explicit language identification — model processes all inputs with same weights regardless of language","Training data is machine-translated, not native-speaker crafted, potentially missing language-specific attack patterns","Performance may degrade for low-resource languages or code-mixed inputs","Requires understanding of LlamaFirewall architecture and scanner composition","Integration adds orchestration overhead; no built-in performance optimization across scanners","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.39999999999999997,"match_graph":0.25,"freshness":0.52,"weights":{"adoption":0.35,"quality":0.2,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:05.295Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=prompt-guard","compare_url":"https://unfragile.ai/compare?artifact=prompt-guard"}},"signature":"XjpBfAOv4o2DdNrweuy4gHfn9KQSJ+pb9/62anEHf4BHqVHnxYFuiZAZr5uGLIB9QnZmhZ10aUe11daIctnrBQ==","signedAt":"2026-06-20T08:35:08.411Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/prompt-guard","artifact":"https://unfragile.ai/prompt-guard","verify":"https://unfragile.ai/api/v1/verify?slug=prompt-guard","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}