{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"constitutional-ai","slug":"constitutional-ai","name":"Constitutional AI","type":"prompt","url":"https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback","page_url":"https://unfragile.ai/constitutional-ai","categories":["prompt-engineering"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"constitutional-ai__cap_0","uri":"capability://safety.moderation.self.critique.and.revision.training.loop","name":"self-critique-and-revision training loop","description":"Constitutional AI implements a two-phase training methodology where models first generate self-critiques of their own outputs against a defined constitution of principles, then generate revised responses based on those critiques. This supervised learning phase uses the model's own reasoning to improve outputs before any reinforcement learning, creating a self-improvement loop that doesn't require human annotation of every problematic output. The architecture chains the model's critique capability with its revision capability in a single training pass.","intents":["Train AI models to self-improve without requiring human labeling of every harmful output","Create models that can explain their reasoning when critiquing their own behavior","Reduce human annotation burden in safety-critical training by leveraging model self-evaluation"],"best_for":["AI safety researchers training large language models","Organizations building internal LLM systems with custom safety requirements","Teams implementing alignment techniques beyond standard RLHF"],"limitations":["Requires a well-defined constitution of principles — poorly specified principles lead to inconsistent self-critique","Self-critique quality depends on the base model's reasoning capability — weaker models may generate superficial critiques","No built-in mechanism to detect when the model's self-critique is itself biased or incorrect","Computational cost of generating critiques and revisions for every training sample adds significant overhead to the training pipeline"],"requires":["Base language model with sufficient reasoning capability (Claude-level or equivalent)","Explicitly defined constitution document with clear principles","Training infrastructure supporting multi-turn generation and finetuning","Evaluation methodology to validate critique quality"],"input_types":["text prompts","model-generated outputs to critique","constitution principles (structured text)"],"output_types":["critique text (model's analysis of its own output)","revised output text (improved response based on critique)"],"categories":["safety-moderation","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"constitutional-ai__cap_1","uri":"capability://safety.moderation.constitution.guided.behavior.shaping","name":"constitution-guided behavior shaping","description":"Constitutional AI uses an explicit set of written principles (a 'constitution') to guide model behavior rather than relying solely on implicit patterns learned from human feedback. During training, the model's outputs are evaluated and revised against these explicit principles, creating a transparent governance model where safety and helpfulness rules are codified as text. This approach allows organizations to define their own behavioral principles and have the training process enforce them systematically.","intents":["Define explicit behavioral principles for AI systems that go beyond generic safety guidelines","Ensure model behavior aligns with organization-specific values and policies","Create auditable training processes where safety rules are documented and traceable"],"best_for":["Enterprise teams building AI systems with custom compliance or ethical requirements","Researchers studying how explicit rules affect model behavior vs implicit learning","Organizations needing to explain their AI safety approach to regulators or stakeholders"],"limitations":["Constitution quality directly determines training quality — vague or contradictory principles produce inconsistent results","No automatic mechanism to detect conflicts between principles in the constitution","Model may interpret principles differently than intended, requiring iterative refinement","Constitutional principles are static; they don't adapt to new harmful use cases discovered post-training"],"requires":["Carefully crafted constitution document with clear, non-contradictory principles","Domain expertise to write principles that capture intended behavior","Evaluation framework to test whether trained model actually follows the constitution","Mechanism to update constitution and retrain if principles prove insufficient"],"input_types":["constitution text (explicit principles)","model outputs to evaluate against constitution","feedback on whether outputs align with principles"],"output_types":["trained model weights","behavior aligned with constitutional principles","audit trail of which principles were applied during training"],"categories":["safety-moderation","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"constitutional-ai__cap_2","uri":"capability://safety.moderation.reinforcement.learning.from.ai.feedback.rlaif","name":"reinforcement learning from ai feedback (rlaif)","description":"Constitutional AI implements a reinforcement learning phase where the trained model itself generates preference judgments between pairs of outputs, replacing human annotators in the preference labeling step. The model learns to evaluate which of two responses better follows the constitution, then a preference model is trained on these AI-generated judgments, and finally the original model is trained with RL using this preference model as a reward signal. This creates a scalable alternative to RLHF that reduces human annotation bottlenecks.","intents":["Scale preference-based training without requiring large teams of human annotators","Use the model's own reasoning to evaluate output quality rather than relying on human judgment","Create a feedback loop where the model's preference judgments improve as the model improves"],"best_for":["Large-scale model training where human annotation is a bottleneck","Teams implementing alignment techniques that want to reduce human feedback dependency","Researchers studying whether AI-generated preferences can match or exceed human preferences"],"limitations":["AI-generated preferences may encode the model's own biases rather than objective quality metrics","No guarantee that AI preferences align with human values — requires validation against human judgment","Preference model trained on AI feedback may drift from human preferences over multiple training iterations","Computational cost of generating preference judgments for all output pairs is substantial","Requires careful tuning to prevent reward hacking where the model learns to game the preference model"],"requires":["Base model capable of generating coherent preference judgments","Preference model training infrastructure (typically a classifier or ranking model)","RL training pipeline (PPO or similar algorithm)","Validation methodology to compare AI preferences against human preferences","Monitoring to detect preference drift over training iterations"],"input_types":["pairs of model outputs to compare","constitution principles for evaluating preferences","initial preference judgments (human or AI-generated)"],"output_types":["preference judgments (which output is better and why)","preference model weights","RL-trained model weights optimized for preferences"],"categories":["safety-moderation","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"constitutional-ai__cap_3","uri":"capability://safety.moderation.non.evasive.harmful.query.engagement","name":"non-evasive harmful-query engagement","description":"Constitutional AI trains models to engage substantively with harmful or sensitive queries by explaining their objections rather than refusing outright. When a user asks about a harmful topic, the model is trained to articulate why it has concerns about the request while still providing relevant context or explanation. This is implemented through constitutional principles that encourage transparency and engagement rather than evasion, and through training examples where the model demonstrates this balanced approach.","intents":["Build AI assistants that explain safety boundaries rather than simply refusing requests","Enable users to understand why an AI system won't help with something rather than feeling stonewalled","Create more helpful assistants that can discuss sensitive topics while maintaining safety boundaries"],"best_for":["Customer-facing AI assistants where transparency builds trust","Educational or research contexts where explaining limitations is valuable","Systems where users need to understand safety decisions to work around them appropriately"],"limitations":["Requires careful constitutional principles to prevent the model from providing harmful information while explaining its objections","More computationally expensive than simple refusal — requires generating explanatory text","Risk that detailed explanations of why something is harmful could inadvertently provide a roadmap for harmful behavior","Requires extensive testing to ensure the model doesn't accidentally help with harmful requests while explaining its concerns","May increase user frustration if explanations are perceived as preachy or condescending"],"requires":["Constitutional principles that balance transparency with safety","Training examples demonstrating appropriate engagement with sensitive topics","Evaluation methodology to test that explanations don't inadvertently enable harm","Human review of edge cases where engagement might be inappropriate"],"input_types":["user queries on sensitive or potentially harmful topics","constitutional principles defining appropriate engagement","training examples of good and bad engagement patterns"],"output_types":["explanatory text describing the model's concerns","contextual information relevant to the query","clear statement of what the model won't help with and why"],"categories":["safety-moderation","text-generation-language"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"constitutional-ai__cap_4","uri":"capability://planning.reasoning.chain.of.thought.reasoning.for.transparency","name":"chain-of-thought reasoning for transparency","description":"Constitutional AI incorporates chain-of-thought reasoning into the training process, where models are trained to show their reasoning steps when critiquing outputs and making decisions. This makes the model's decision-making process interpretable and auditable — users and developers can see not just what the model decided but why it made that decision. The reasoning chain becomes part of the training signal, helping the model learn to make decisions that are not just correct but also explainable.","intents":["Make AI safety decisions auditable by showing the reasoning behind them","Help users understand how the model evaluated a request or generated a response","Enable developers to debug model behavior by examining the reasoning chain"],"best_for":["High-stakes applications where decision transparency is required (healthcare, legal, financial)","Regulatory contexts where explainability is mandated","Teams building AI systems where understanding model behavior is critical for safety"],"limitations":["Chain-of-thought reasoning adds latency to inference — models must generate reasoning before generating outputs","Reasoning chains can be misleading if the model's reasoning is flawed but sounds plausible","Longer reasoning chains increase token usage and computational cost","Users may over-trust reasoning that sounds coherent but is actually incorrect","Reasoning transparency doesn't guarantee the underlying decision is correct or fair"],"requires":["Model capable of generating coherent step-by-step reasoning","Training data with examples of good reasoning chains","Evaluation methodology to validate that reasoning chains are actually sound","User interface that can present reasoning chains clearly"],"input_types":["user queries or model outputs to explain","constitutional principles to reason about","training examples with reasoning chains"],"output_types":["step-by-step reasoning text","intermediate conclusions","final decision with justification"],"categories":["planning-reasoning","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"constitutional-ai__cap_5","uri":"capability://safety.moderation.human.evaluated.safety.benchmarking","name":"human-evaluated safety benchmarking","description":"Constitutional AI includes a human evaluation framework where trained models are assessed by human judges on dimensions like harmlessness, helpfulness, and honesty. The evaluation process measures how well the model follows the constitution and whether it achieves the intended safety properties. This creates a feedback loop where human evaluation results inform whether the constitutional principles are working as intended and whether additional training iterations are needed.","intents":["Validate that constitutional training actually produces safer models","Measure whether AI-generated preferences align with human judgment","Identify gaps between intended behavior (constitution) and actual behavior (human evaluation)"],"best_for":["Research teams validating new alignment techniques","Organizations building safety-critical AI systems that need human validation","Teams implementing Constitutional AI who need to measure training effectiveness"],"limitations":["Human evaluation is expensive and time-consuming — limits the scale of evaluation","Human judges may have inconsistent standards or biases that affect evaluation","Evaluation results are specific to the judges and evaluation criteria used — may not generalize","Requires careful design of evaluation rubrics to ensure consistent judgment","Human evaluation is typically done on a sample of outputs, not comprehensive coverage"],"requires":["Panel of human evaluators with relevant expertise","Clear evaluation rubrics defining harmlessness, helpfulness, and honesty","Evaluation dataset with diverse scenarios and edge cases","Statistical analysis methodology to aggregate judge opinions","Inter-rater reliability measurement to validate consistency"],"input_types":["model outputs to evaluate","evaluation rubrics and criteria","context about the original user query"],"output_types":["harmlessness scores","helpfulness scores","honesty/transparency scores","qualitative feedback from judges","aggregate statistics on model performance"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"constitutional-ai__cap_6","uri":"capability://safety.moderation.multi.principle.constitution.composition","name":"multi-principle constitution composition","description":"Constitutional AI supports defining multiple, potentially overlapping principles in a single constitution document, allowing organizations to encode complex behavioral rules that balance competing values. The training process must navigate cases where principles conflict or apply differently to different scenarios. The model learns to reason about which principles apply in which contexts and how to balance them when they conflict.","intents":["Define nuanced behavioral rules that balance helpfulness with safety","Encode organization-specific values that may differ from generic safety guidelines","Create constitutions that handle edge cases where simple rules don't apply"],"best_for":["Organizations with complex safety requirements that can't be captured in a single rule","Teams building AI systems for specific domains with domain-specific principles","Researchers studying how models handle conflicting objectives"],"limitations":["No automatic mechanism to detect conflicts between principles — requires manual review","Model may apply principles inconsistently when they conflict","Difficult to debug which principle caused a particular model behavior when multiple principles apply","Increasing the number of principles increases training complexity and computational cost","Principles may interact in unexpected ways, requiring extensive testing"],"requires":["Careful design of principles to minimize conflicts","Clear priority ordering or conflict resolution rules","Testing methodology to validate behavior when multiple principles apply","Documentation of principle interactions and edge cases"],"input_types":["multiple constitutional principles (text)","scenarios where principles might conflict","training examples demonstrating principle application"],"output_types":["trained model that applies multiple principles","reasoning showing which principles were applied","behavior that balances competing principles"],"categories":["safety-moderation","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"constitutional-ai__cap_7","uri":"capability://safety.moderation.iterative.constitution.refinement","name":"iterative constitution refinement","description":"Constitutional AI supports an iterative development process where initial constitutions are tested, evaluated against human judgment, and refined based on results. When human evaluation reveals that the model's behavior doesn't match the intended constitution, the constitution can be updated with clarifications, additional principles, or principle revisions, and the model can be retrained. This creates a feedback loop between evaluation results and constitution design.","intents":["Improve constitutional principles based on real model behavior and human feedback","Discover gaps in the constitution that weren't apparent during initial design","Refine principles that are too vague or produce inconsistent results"],"best_for":["Teams building safety-critical systems who need to iterate on safety rules","Organizations implementing Constitutional AI for the first time and learning what works","Researchers studying how to design effective constitutions"],"limitations":["Each iteration requires retraining the model, which is computationally expensive","Difficult to isolate which constitution changes caused behavior changes","Risk of overfitting the constitution to specific evaluation examples","No systematic methodology for determining when a constitution is 'good enough'","Iterative refinement can be slow if evaluation cycles are long"],"requires":["Evaluation methodology to identify constitution gaps","Version control for constitution documents","Retraining infrastructure that can handle multiple iterations","Clear criteria for deciding when iteration is complete","Documentation of constitution evolution and rationale for changes"],"input_types":["initial constitution","human evaluation results","model behavior analysis","feedback on constitution clarity"],"output_types":["refined constitution document","retrained model","evaluation results showing improvement","change log documenting constitution evolution"],"categories":["safety-moderation","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"constitutional-ai__cap_8","uri":"capability://safety.moderation.constitutional.principle.extraction.from.examples","name":"constitutional principle extraction from examples","description":"Constitutional AI can derive or validate constitutional principles by analyzing examples of desired and undesired model behavior. Rather than writing principles from scratch, organizations can provide examples of outputs they want the model to produce and outputs they want to avoid, and use these examples to inform or validate the constitution. This approach grounds principles in concrete behavior rather than abstract values.","intents":["Develop constitutions based on concrete examples of desired behavior","Validate that written principles actually capture the intended behavior","Discover implicit principles that weren't explicitly articulated"],"best_for":["Teams that have examples of good and bad model behavior but haven't formalized principles","Organizations validating that their constitution matches their actual values","Researchers studying what principles are implicit in human preferences"],"limitations":["Extracting principles from examples is subjective — different people may extract different principles","Examples may not cover all edge cases, leading to incomplete principles","Principles extracted from examples may be overly specific and not generalize","Requires significant manual effort to analyze examples and extract principles","No guarantee that extracted principles will work well for new scenarios not covered by examples"],"requires":["Large set of examples with clear labels (good/bad behavior)","Methodology for analyzing examples to extract principles","Domain expertise to interpret what principles the examples represent","Validation that extracted principles work for new scenarios"],"input_types":["examples of desired model outputs","examples of undesired model outputs","context about why each example is good or bad"],"output_types":["extracted constitutional principles","validation that principles match examples","confidence scores for each principle"],"categories":["safety-moderation","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":49,"verified":false,"data_access_risk":"high","permissions":["Base language model with sufficient reasoning capability (Claude-level or equivalent)","Explicitly defined constitution document with clear principles","Training infrastructure supporting multi-turn generation and finetuning","Evaluation methodology to validate critique quality","Carefully crafted constitution document with clear, non-contradictory principles","Domain expertise to write principles that capture intended behavior","Evaluation framework to test whether trained model actually follows the constitution","Mechanism to update constitution and retrain if principles prove insufficient","Base model capable of generating coherent preference judgments","Preference model training infrastructure (typically a classifier or ranking model)"],"failure_modes":["Requires a well-defined constitution of principles — poorly specified principles lead to inconsistent self-critique","Self-critique quality depends on the base model's reasoning capability — weaker models may generate superficial critiques","No built-in mechanism to detect when the model's self-critique is itself biased or incorrect","Computational cost of generating critiques and revisions for every training sample adds significant overhead to the training pipeline","Constitution quality directly determines training quality — vague or contradictory principles produce inconsistent results","No automatic mechanism to detect conflicts between principles in the constitution","Model may interpret principles differently than intended, requiring iterative refinement","Constitutional principles are static; they don't adapt to new harmful use cases discovered post-training","AI-generated preferences may encode the model's own biases rather than objective quality metrics","No guarantee that AI preferences align with human values — requires validation against human judgment","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.8500000000000001,"ecosystem":0.15000000000000002,"match_graph":0.25,"freshness":0.9,"weights":{"adoption":0.15,"quality":0.25,"ecosystem":0.1,"match_graph":0.45,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:21.548Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=constitutional-ai","compare_url":"https://unfragile.ai/compare?artifact=constitutional-ai"}},"signature":"TiENEj+C1facb1gzIAUfsmQFR6wxR+eJcQzvTuOfWjC9PNNjGIhyMZW/yV/Hn4nxz/GO9ytJVmmq0UZW0Bs7CQ==","signedAt":"2026-06-15T06:53:13.063Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/constitutional-ai","artifact":"https://unfragile.ai/constitutional-ai","verify":"https://unfragile.ai/api/v1/verify?slug=constitutional-ai","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}