Curated Adversarial Prompt Dataset With Human Annotations

1

LMSYS Chatbot ArenaBenchmark62/100

via “crowdsourced prompt collection and curation”

Crowdsourced LLM evaluation — side-by-side blind voting, Elo ratings, most trusted LLM benchmark.

Unique: Leverages the community to continuously expand the benchmark dataset rather than relying on a fixed set of expert-curated prompts. Prompts are selected for evaluation based on community interest, creating a living benchmark that evolves with user priorities.

vs others: More scalable and diverse than expert-curated benchmarks because it taps community creativity; more representative of real-world usage than synthetic prompt sets

2

BIG-Bench Hard (BBH)Dataset59/100

via “few-shot prompt engineering and optimization”

23 hardest BIG-Bench tasks where models initially failed.

Unique: Provides structured few-shot exemplars that are explicitly designed for prompt engineering experimentation, enabling researchers to test prompt sensitivity and optimization strategies without task re-annotation. The dataset structure supports exemplar variation and prompt template modification.

vs others: More suitable for prompt engineering research than generic task collections because it includes curated exemplars; more flexible than fixed-prompt benchmarks because exemplars can be modified and optimized.

3

OpenAssistant Conversations (OASST)Dataset57/100

via “large-scale human-written dataset with volunteer annotation pipeline”

161K human-written messages in 35 languages with quality ratings.

Unique: Largest human-written (not LLM-generated) instruction dataset at scale, created by 13,000+ volunteers rather than single-model generation or synthetic methods. Preserves natural human diversity in writing and preferences.

vs others: More authentic and diverse than LLM-generated datasets (e.g., Alpaca, ShareGPT based on ChatGPT) or synthetic preference pairs. Larger human-written component than most alternatives, though with quality variance requiring filtering.

4

WinoGrandeDataset57/100

via “bias-resistant example curation through adversarial filtering”

44K pronoun resolution problems testing commonsense understanding.

Unique: Applies adversarial filtering specifically targeting statistical shortcuts (word frequency, syntactic position, gender stereotypes) through automated correlation analysis + human validation, rather than passive bias documentation; filtering is integrated into dataset construction rather than post-hoc

vs others: More proactive than datasets with bias documentation (e.g., BOLD) because biases are removed rather than flagged; more systematic than manual curation because automated detection identifies subtle correlations humans might miss

5

UltraChat 200KDataset57/100

via “multi-turn dialogue dataset curation and filtering”

200K high-quality multi-turn dialogues for instruction tuning.

Unique: Uses dual-agent ChatGPT generation (user and assistant roles) with category-stratified sampling across three semantic domains, then applies quality filtering to create a balanced 200K subset — this synthetic-then-filtered approach differs from crowdsourced datasets (which have annotation overhead) and raw model outputs (which lack quality curation)

vs others: Larger and more diverse than hand-annotated dialogue datasets (e.g., ShareGPT), yet more curated and category-balanced than raw model-generated conversation dumps, making it ideal for training models that generalize across multiple dialogue types

6

WildGuardDataset56/100

Allen AI's safety classification dataset and model.

Unique: Combines three annotation dimensions (prompt harmfulness, response harmfulness, refusal appropriateness) in a single dataset, enabling multi-task learning and comprehensive safety evaluation — most public datasets focus on only one dimension

vs others: More comprehensive than generic toxicity datasets (e.g., Jigsaw) because it's specifically curated for adversarial prompts and LLM jailbreaks; more detailed than simple safe/unsafe labels because it provides fine-grained harm categories and multi-dimensional annotations

7

Awesome-Prompt-EngineeringPrompt36/100

via “prompt-engineering-dataset-and-benchmark-reference”

This repository contains a hand-curated resources for Prompt Engineering with a focus on Generative Pre-trained Transformer (GPT), ChatGPT, PaLM etc

Unique: Focuses specifically on prompt engineering datasets and benchmarks rather than general NLP datasets, documenting evaluation metrics and use cases specific to prompt optimization

vs others: More specialized than general dataset repositories because it curates for prompt engineering relevance; more accessible than academic papers because it provides direct links and practical descriptions

8

VBenchBenchmark36/100

via “standardized prompt suite generation and curation for video model comparison”

[CVPR2024 Highlight] VBench - We Evaluate Video Generation

Unique: Curates prompts with explicit semantic stratification (objects, actions, scenes, attributes) and validates against human preference annotations to ensure prompts discriminate between model quality levels. Maintains separate prompt suites for T2V, I2V, and long-video evaluation with dimension-aware metadata mapping.

vs others: More rigorous than ad-hoc prompt selection because prompts are validated against human preferences and stratified by semantic category; more reproducible than user-defined prompts because the suite is fixed and publicly available.

9

chatgpt_system_promptPrompt33/100

via “multi-source-prompt-aggregation-and-curation”

A collection of GPT system prompts and various prompt injection/leaking knowledge.

Unique: Maintains three parallel prompt collections (official-product with 141+ entries, gpts with 1,100+ entries, opensource-prj with 20+ entries) in separate directory hierarchies, each with its own TOC, enabling both source-specific browsing and cross-source comparison. The architecture preserves source identity while enabling unified discovery through the root-level TOC.md.

vs others: More comprehensive than vendor-specific prompt collections (e.g., OpenAI's official docs alone) because it includes community contributions and competing vendors, but less curated than specialized prompt marketplaces that apply quality filters or user ratings.

10

PROMPTS.mdDataset23/100

via “markdown-based prompt template library with contributor attribution”

| [Hugging Face Dataset](https://huggingface.co/datasets/fka/prompts.chat) |

Unique: Combines GitHub raw file hosting with Hugging Face dataset mirroring, enabling both direct markdown parsing and programmatic access through the datasets library without requiring a custom API layer. Uses simple markdown structure with contributor attribution via GitHub usernames, making contributions transparent and discoverable.

vs others: Simpler and more transparent than proprietary prompt marketplaces because it's version-controlled on GitHub with visible contributor history, and more accessible than academic prompt datasets because it requires no authentication or complex tooling.

11

imgsysBenchmark21/100

via “prompt standardization and benchmark dataset curation”

A generative image model arena by fal.ai.

Unique: Curates a community-validated prompt set that balances breadth (covering diverse image generation tasks) with depth (multiple prompts per category to reduce noise). Prompts are tagged with difficulty and capability dimensions, enabling stratified analysis rather than single aggregate scores.

vs others: More representative of diverse use cases than academic benchmarks (which focus on narrow metrics), and more stable than user-submitted prompts (which vary in quality and intent). However, less comprehensive than proprietary model evaluation suites that test thousands of edge cases.

12

Ordinary People PromptsPrompt

via “prompt-quality-curation-without-versioning”

Unique: Relies on human editorial curation as a quality signal rather than community voting, algorithmic ranking, or performance metrics, but lacks the versioning infrastructure needed to maintain accuracy as models evolve

vs others: Provides editorial trust that community-driven repositories lack, but offers no version tracking or model-specific guidance that more mature prompt management platforms (e.g., LangSmith, Prompt Flow) provide

13

Chatbot ArenaBenchmark

via “community-driven prompt curation and task distribution”

14

Prompt JourneyPrompt

via “industry-vertical prompt curation”

Unique: Uses pure editorial curation without algorithmic ranking, community voting, or performance metrics — a human-first approach that trades data-driven optimization for simplicity and accessibility

vs others: More trustworthy for beginners than algorithmic recommendations, but less effective than community-driven platforms like PromptBase that aggregate user feedback and success metrics

Top Matches

Also Known As

Company