Common Crawl
DatasetFreeLargest open web crawl archive, foundation of all LLM training data.
Capabilities9 decomposed
petabyte-scale monthly web crawl capture with warc archival
Medium confidenceExecutes monthly crawl cycles capturing 3-5 billion web pages using the CCBot crawler agent, storing raw HTTP responses, headers, and page content in WARC (Web ARChive) format on AWS S3. Respects robots.txt and maintains an opt-out registry to exclude domains from crawling. Each monthly snapshot becomes a permanent archive layer, accumulating 300+ billion pages across 15+ years of operation.
Operates as a non-profit public infrastructure project with 15+ years of continuous monthly crawls stored in standard WARC format, making it the largest open web archive. Unlike commercial crawlers, Common Crawl publishes entire monthly snapshots as immutable archives rather than incremental updates, enabling reproducible research across time periods.
Larger and more freely accessible than Wayback Machine (which focuses on specific URL preservation), and more standardized than proprietary web crawl datasets used by search engines or AI companies
cdxj index-based url lookup and warc file location resolution
Medium confidenceProvides CDXJ (Capture inDeX JSON) indices that map URLs to their locations within WARC files, enabling random access to specific crawled pages without scanning entire archives. The index structure stores URL metadata and WARC file offsets, allowing efficient retrieval of individual pages from petabyte-scale datasets. Users query the index to locate a URL, then fetch only the relevant WARC segment from S3.
Uses CDXJ (JSON-based capture index) format for URL-to-WARC mapping, enabling O(log n) lookup instead of linear WARC scanning. This approach allows researchers to retrieve individual pages from petabyte archives without downloading entire monthly snapshots, making Common Crawl accessible to resource-constrained teams.
More efficient than downloading full WARC files and more standardized than proprietary index formats used by commercial web archives
columnar index for metadata-based filtering and analytics
Medium confidenceProvides a columnar index structure (format and technical details unknown from documentation) that enables efficient filtering and aggregation across crawl metadata without accessing raw WARC content. Allows queries on metadata dimensions like domain, content type, HTTP status codes, and capture timestamps. Designed for analytical workloads that need statistics or filtered subsets of the crawl without full content retrieval.
Unknown — insufficient data. Documentation mentions columnar index existence but provides no technical specification, query interface, or usage examples.
Unknown — insufficient data to compare against alternative indexing approaches
web graph extraction with domain-level link structure and backlink analysis
Medium confidenceExtracts domain-level link graph from crawl data, capturing which domains link to which other domains and backlink relationships. Produces graph data (format unknown) representing the web's connectivity structure. Enables analysis of domain authority, link patterns, and web topology without processing raw page content. Referenced as 'BacklinkDB' in documentation but technical details not provided.
Unknown — insufficient data. Documentation references BacklinkDB and web graph extraction but provides no technical specification, format details, or usage documentation.
Unknown — insufficient data to compare against alternative graph extraction approaches
raw warc file storage and s3-based distributed access
Medium confidenceStores all crawled web content in WARC (Web ARChive) format on AWS S3 public buckets, enabling distributed access without centralized bottlenecks. WARC is the ISO 28500 standard for web archival, containing HTTP requests, responses, headers, and payloads in a sequential record format. S3 storage provides global availability, parallel download capability, and HTTP range request support for partial file retrieval. Users access files directly via S3 API or HTTP without intermediary services.
Uses standard ISO 28500 WARC format stored on public AWS S3 buckets, avoiding proprietary formats and enabling use of standard archive tools. This approach prioritizes interoperability and long-term preservation over convenience, allowing any tool that understands WARC to access the data without vendor lock-in.
More standardized and openly accessible than proprietary web crawl formats used by search engines or commercial data providers, and more durable than centralized APIs that could be deprecated
robots.txt and opt-out registry compliance for crawl exclusion
Medium confidenceImplements crawl exclusion mechanisms respecting robots.txt directives and a maintained opt-out registry where domain owners can request exclusion from future crawls. CCBot crawler agent checks robots.txt before crawling and consults the opt-out registry to avoid capturing content from domains that have requested exclusion. Provides a submission mechanism (details unknown) for domains to register opt-out requests.
Maintains an explicit opt-out registry separate from robots.txt, providing domain owners with a dedicated mechanism to request exclusion from future crawls. This dual-mechanism approach (robots.txt + registry) offers both technical and administrative control, though the registry submission process and enforcement details are not publicly documented.
More transparent than search engine crawlers regarding exclusion mechanisms, though less documented than robots.txt standard itself
hugging face integration for dataset discovery and download
Medium confidenceProvides integration with Hugging Face Hub enabling discovery and download of Common Crawl data through the Hugging Face ecosystem. Specific integration details, API format, and available datasets unknown from documentation. Allows researchers to access Common Crawl data through familiar Hugging Face tools and interfaces rather than direct S3 access.
Unknown — insufficient data. Documentation mentions Hugging Face integration exists but provides no technical specification, available datasets, or usage examples.
Unknown — insufficient data to compare against alternative integration approaches
community support and documentation via mailing list, discord, and faq
Medium confidenceProvides community support infrastructure including a mailing list archive, Discord community channel, and FAQ section addressing common questions about data access, format, and usage. Enables peer-to-peer support and knowledge sharing among researchers and practitioners using Common Crawl. Blog with examples provides practical guidance on common tasks.
Operates as a non-profit with community-driven support model rather than commercial support tiers. Provides multiple communication channels (mailing list, Discord, FAQ, blog) enabling asynchronous and synchronous help, though without formal SLAs or guaranteed response times.
More accessible and community-oriented than commercial data providers, though less formal than enterprise support offerings
foundational data source for major language model training datasets
Medium confidenceServes as the primary raw data source for downstream dataset creation pipelines including C4, The Pile, RedPajama, FineWeb, and Dolma. These datasets apply text extraction, deduplication, filtering, and quality curation on top of Common Crawl's raw WARC archives to produce cleaned, deduplicated text suitable for language model training. Common Crawl provides the petabyte-scale raw material; downstream projects handle cleaning and curation.
Provides the foundational petabyte-scale raw material for virtually every major open-source language model training dataset (C4, The Pile, RedPajama, FineWeb, Dolma). Unlike these downstream datasets, Common Crawl remains raw and unprocessed, allowing researchers to apply custom filtering and curation rather than being locked into pre-defined dataset compositions.
More comprehensive and openly accessible than proprietary web crawls used by commercial AI companies, and more flexible than pre-curated datasets that apply fixed filtering criteria
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Common Crawl, ranked by overlap. Discovered automatically through the match graph.
FineWeb
Hugging Face's 15T token dataset, new standard for LLM training.
Firecrawl
API to turn websites into LLM-ready markdown — crawl, scrape, and map with JS rendering.
MINT-1T-PDF-CC-2023-50
Dataset by mlfoundations. 7,96,577 downloads.
You.com
A search engine built on AI that provides users with a customized search experience while keeping their data 100% private.
MINT-1T-PDF-CC-2023-06
Dataset by mlfoundations. 5,39,406 downloads.
c4
Dataset by allenai. 6,98,456 downloads.
Best For
- ✓ML researchers training language models at scale
- ✓Web archive researchers studying internet history
- ✓Academic institutions with petabyte-scale processing infrastructure
- ✓Non-profit organizations building open datasets
- ✓Researchers needing specific pages from historical crawls
- ✓Dataset builders filtering Common Crawl by URL patterns
- ✓Web analysis tools requiring efficient page retrieval
- ✓Teams building downstream datasets (C4, The Pile, RedPajama, FineWeb, Dolma)
Known Limitations
- ⚠Monthly frequency only — not real-time or continuous capture
- ⚠Crawl lag between capture and indexing availability unknown
- ⚠Limited to publicly crawlable web — excludes authenticated, paywalled, and dynamically-rendered content
- ⚠No built-in JavaScript rendering — captures static HTML only
- ⚠Raw WARC files contain unfiltered content including spam, duplicates, and malicious pages
- ⚠CDXJ query syntax and API not documented in provided materials
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Non-profit organization maintaining the largest open web crawl archive, containing petabytes of raw web data collected since 2008. Monthly crawls capture 3-5 billion web pages each. The foundational data source behind virtually every major language model training dataset including C4, The Pile, RedPajama, FineWeb, and Dolma. Stored on AWS S3 as WARC files with URL indices. Free to access but requires significant processing to extract clean text suitable for model training.
Categories
Alternatives to Common Crawl
The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.
Compare →FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,
Compare →Are you the builder of Common Crawl?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →