Common Crawl
DatasetFreeLargest open web crawl archive, foundation of all LLM training data.
Capabilities12 decomposed
petabyte-scale monthly web crawl ingestion and archival
Medium confidenceOperates a distributed web crawler (CCBot) that systematically traverses 3-5 billion web pages monthly, capturing raw HTML, metadata, and response headers into WARC (Web ARChive) format files stored on AWS S3. The crawl respects robots.txt directives and maintains an opt-out registry for content exclusion. Each monthly snapshot is immutable and indexed for retrieval, creating a cumulative archive of 300+ billion pages spanning 15+ years of web history.
Operates the largest open web crawl archive with 300+ billion pages spanning 15+ years, maintained as a non-profit public good with monthly refresh cycles and dual indexing (CDXJ + columnar) for both URL-based and structured queries. No commercial competitor maintains equivalent historical depth and scale.
Larger, older, and more freely accessible than commercial web archives (Wayback Machine, Archive.org) with explicit support for ML training pipelines and no rate-limiting for research use.
cdxj-indexed url-based retrieval from web archive
Medium confidenceProvides CDXJ (Capture inDeX JSON) indices that map URLs to byte offsets within WARC files, enabling direct random access to specific pages without scanning entire archives. Queries specify a URL and optional date range, returning matching captures with metadata (HTTP status, content type, timestamp). This index layer abstracts away WARC file complexity and enables efficient lookup of historical versions of individual pages.
Uses CDXJ standard (JSON-based capture index) rather than proprietary indexing, enabling interoperability with other web archive tools and allowing byte-offset-based random access to WARC files without full-file decompression. Supports both exact and wildcard URL matching.
More efficient than sequential WARC scanning for URL lookups and more standardized than Wayback Machine's custom index format, enabling third-party tool integration.
infrastructure status monitoring and errata tracking
Medium confidencePublishes infrastructure status updates, known issues, and errata for crawls through a public status page and mailing list. Issues are documented with affected crawls, impact assessment, and workarounds. Status monitoring includes S3 availability, index health, and crawl progress. Errata tracking enables users to identify and work around data quality issues in specific crawls.
Maintains public errata tracking and status monitoring for crawls, enabling users to identify and work around data quality issues. Combines status page, mailing list, and documentation for transparency.
More transparent than proprietary data sources; public errata tracking enables community awareness of issues, whereas most competitors provide no visibility into data quality problems.
ccbot crawler with configurable crawl parameters
Medium confidenceOperates a distributed web crawler (CCBot) that can be configured with custom crawl parameters including politeness delays, user-agent strings, robots.txt interpretation, and domain-specific crawl budgets. The crawler respects HTTP standards and robots.txt directives, with configurable behavior for handling redirects, timeouts, and errors. Crawl parameters are documented for each monthly release, enabling reproducibility and evaluation of crawl quality.
Publishes crawl parameters and methodology for each monthly release, enabling reproducibility and evaluation of crawl quality. Crawler respects HTTP standards and robots.txt, with documented politeness policies.
More transparent about crawl methodology than proprietary crawlers; published parameters enable reproducibility and comparison with other crawling approaches.
columnar-indexed structured query access to web archive metadata
Medium confidenceProvides columnar indices (format and query syntax unspecified in documentation) that enable structured queries across archive metadata without parsing WARC files. Queries can filter by domain, content-type, HTTP status, crawl date, and other fields, returning matching page metadata and offsets. This approach trades random-access flexibility for efficient bulk filtering and aggregation across billions of pages.
Uses columnar storage (likely Parquet or similar) for metadata indices, enabling efficient filtering and aggregation across billions of pages without decompressing WARC files. Supports multi-field queries and bulk statistics generation.
More efficient than CDXJ for bulk filtering and aggregation queries; enables data engineers to pre-filter before WARC parsing, reducing downstream processing costs.
web graph extraction and backlink relationship analysis
Medium confidenceExtracts hyperlink relationships from crawled pages to construct a directed web graph showing which pages link to which other pages. This graph data is provided separately from raw page content, enabling analysis of link structure, PageRank-like metrics, and domain authority without parsing HTML. The extraction process identifies both internal (same-domain) and external (cross-domain) links.
Extracts hyperlink graph from petabyte-scale web crawl, providing researchers with a snapshot of global web topology at monthly intervals. Graph data is separated from content, enabling efficient analysis without parsing HTML.
Larger and more recent than academic web graph datasets (e.g., WebGraph, SNAP); freely available and updated monthly, whereas most academic graphs are static or years old.
historical web snapshot retrieval across 15-year archive
Medium confidenceEnables retrieval of any page version from the cumulative 300+ billion page archive spanning 2007-present, with monthly granularity. Users specify a URL and date range, and the system returns all captures of that page from matching crawls. This creates a time-series view of how individual pages evolved, including content changes, design updates, and deletion/resurrection events.
Maintains 15+ years of monthly web snapshots (300+ billion pages cumulative), enabling fine-grained temporal analysis of web content evolution. No commercial competitor offers equivalent historical depth at this scale.
Larger and more comprehensive than Internet Archive's Wayback Machine for bulk historical analysis; free and designed for programmatic access rather than interactive browsing.
warc format raw data export with http headers and metadata
Medium confidenceExports raw web content in WARC (Web ARChive) format, a standardized container that bundles HTTP request/response pairs with metadata. Each WARC record includes the original HTTP status code, headers, response body (HTML, JSON, binary), and crawl metadata (timestamp, IP address, user-agent). WARC files are gzip-compressed and stored on S3, with indices enabling random access to specific records without decompressing entire files.
Uses WARC standard format (ISO 28500) rather than proprietary encoding, ensuring long-term preservation and interoperability with other archival tools. Stores on AWS S3 with public access, enabling direct programmatic access without intermediary APIs.
More standardized and preservation-friendly than custom formats; larger and more recent than academic web corpora; free and designed for large-scale processing rather than interactive access.
monthly crawl release coordination and versioning
Medium confidencePublishes monthly snapshots of the web crawl on a documented schedule, with each release including 2-5 billion pages, comprehensive statistics (page counts, size, coverage by domain/TLD), and known issues (errata). Each crawl is assigned a unique identifier and published with metadata enabling reproducible research. The release process includes documentation of crawl parameters (user-agent, politeness delays, robots.txt compliance) and known limitations.
Publishes monthly crawl snapshots with comprehensive statistics and errata tracking, enabling reproducible research and version-pinning. Each crawl is immutable and independently documented, supporting long-term archival and citation.
More transparent and reproducible than proprietary web data sources; monthly releases enable tracking of web evolution, whereas most competitors provide static or infrequently-updated snapshots.
robots.txt and opt-out registry compliance enforcement
Medium confidenceRespects robots.txt directives and maintains an opt-out registry allowing content creators to exclude their sites from crawling and archival. The CCBot crawler checks robots.txt before crawling each domain and honors disallow rules. Additionally, a public opt-out registry enables site owners to request retroactive removal from the archive. Compliance is enforced at crawl time (robots.txt) and archive time (opt-out registry).
Maintains a public opt-out registry and enforces robots.txt compliance at crawl time, providing content creators with control over archival. Combines crawler-side (robots.txt) and archive-side (opt-out registry) mechanisms.
More transparent and creator-friendly than proprietary web archives; explicit opt-out mechanism enables content removal, whereas some competitors provide no removal option.
hugging face integration and dataset export
Medium confidenceProvides integration with Hugging Face Hub, enabling researchers to access Common Crawl data through the Hugging Face datasets library and export processed datasets directly to the Hub. This integration abstracts away S3 access complexity and enables one-line dataset loading in Python. Processed datasets (C4, The Pile, RedPajama, FineWeb, Dolma) are published on the Hub with documentation and usage examples.
Integrates with Hugging Face Hub to provide one-line dataset loading for Common Crawl-derived datasets, abstracting away S3 access and WARC parsing. Enables community dataset sharing and discovery.
Simpler than direct S3 access for Python users; enables dataset discovery and comparison across multiple processing pipelines (C4, The Pile, RedPajama, FineWeb, Dolma).
community-maintained extraction and processing pipelines
Medium confidenceEnables third-party researchers and organizations to build and publish extraction pipelines that transform raw Common Crawl WARC data into clean, deduplicated, filtered datasets suitable for model training. Major pipelines (C4, The Pile, RedPajama, FineWeb, Dolma) are published with open-source code, documentation, and reproducible builds. These pipelines handle deduplication, language filtering, quality scoring, and format conversion.
Enables community-driven extraction pipelines with published code and documentation, creating a transparent ecosystem of dataset processing approaches. Major pipelines (C4, The Pile, RedPajama, FineWeb, Dolma) are open-source and reproducible.
More transparent and reproducible than proprietary dataset processing; enables community contribution and comparison of different approaches, whereas most commercial datasets are black-box.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Common Crawl, ranked by overlap. Discovered automatically through the match graph.
RedPajama v2
30 trillion token web dataset with 40+ quality signals per document.
mcp-wayback-machine
Save any URL to the Internet Archive and retrieve archived snapshots on demand. Search captures by date and get capture counts and history for any site. Preserve and audit web content without managing API keys.
You.com
A search engine built on AI that provides users with a customized search experience while keeping their data 100% private.
Tavily API
Search API for AI agents — clean web content, answer extraction, designed for RAG and LLM apps.
PimEyes
Explore digital footprints with AI-driven facial recognition...
Hotbot
HotBot is an AI-powered search engine that provides users with fast and personalized search results....
Best For
- ✓ML/NLP researchers building large-scale training datasets (C4, The Pile, RedPajama, FineWeb, Dolma all depend on this)
- ✓Web historians and researchers studying internet evolution
- ✓Organizations needing compliance-friendly web archives with documented crawl dates
- ✓Researchers studying specific websites or domains over time
- ✓Data engineers building incremental extraction pipelines (query by URL, fetch only changed pages)
- ✓Web archivists and historians tracking individual page evolution
- ✓Data engineers building production pipelines that depend on Common Crawl availability
- ✓Researchers requiring high-quality data and wanting to avoid problematic crawls
Known Limitations
- ⚠Raw WARC format requires specialized parsing tools; no built-in text extraction API
- ⚠Crawl frequency is monthly, not real-time; latest data is 1-4 weeks old
- ⚠Content respects robots.txt and opt-out registry, so paywalled, authenticated, or excluded sites are missing
- ⚠No deduplication or quality filtering applied at crawl time; downstream processing required to remove spam, malformed HTML, and duplicates
- ⚠Bias toward crawlable, English-language, and publicly indexable content; non-English and dynamic content underrepresented
- ⚠CDXJ query syntax and API endpoint details not documented in provided materials; requires reverse-engineering or community documentation
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Non-profit organization maintaining the largest open web crawl archive, containing petabytes of raw web data collected since 2008. Monthly crawls capture 3-5 billion web pages each. The foundational data source behind virtually every major language model training dataset including C4, The Pile, RedPajama, FineWeb, and Dolma. Stored on AWS S3 as WARC files with URL indices. Free to access but requires significant processing to extract clean text suitable for model training.
Categories
Alternatives to Common Crawl
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of Common Crawl?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →