Common Crawl

Q: What can Common Crawl do?

petabyte-scale monthly web crawl capture with warc archival, cdxj index-based url lookup and warc file location resolution, columnar index for metadata-based filtering and analytics, web graph extraction with domain-level link structure and backlink analysis, raw warc file storage and s3-based distributed access, robots.txt and opt-out registry compliance for crawl exclusion, hugging face integration for dataset discovery and download, community support and documentation via mailing list, discord, and faq, foundational data source for major language model training datasets

DatasetFree

Largest open web crawl archive, foundation of all LLM training data.

Open Source

/ 100

9 capabilities

Capabilities9 decomposed

petabyte-scale monthly web crawl capture with warc archival

Medium confidence

Executes monthly crawl cycles capturing 3-5 billion web pages using the CCBot crawler agent, storing raw HTTP responses, headers, and page content in WARC (Web ARChive) format on AWS S3. Respects robots.txt and maintains an opt-out registry to exclude domains from crawling. Each monthly snapshot becomes a permanent archive layer, accumulating 300+ billion pages across 15+ years of operation.

Solves for

Access historical snapshots of the public web from specific time periodsRetrieve raw HTTP headers and response metadata alongside page contentBuild training datasets from large-scale web content without licensing restrictionsAnalyze web structure and content evolution over time

Best for

ML researchers training language models at scale

Web archive researchers studying internet history

Academic institutions with petabyte-scale processing infrastructure

Requires

AWS S3 access credentials or public bucket read permissions

WARC format parsing library (e.g., warcio for Python)

Significant storage capacity (petabytes for full archive)

Limitations

Monthly frequency only — not real-time or continuous capture

Crawl lag between capture and indexing availability unknown

Limited to publicly crawlable web — excludes authenticated, paywalled, and dynamically-rendered content

What makes it unique

Operates as a non-profit public infrastructure project with 15+ years of continuous monthly crawls stored in standard WARC format, making it the largest open web archive. Unlike commercial crawlers, Common Crawl publishes entire monthly snapshots as immutable archives rather than incremental updates, enabling reproducible research across time periods.

vs alternatives

Larger and more freely accessible than Wayback Machine (which focuses on specific URL preservation), and more standardized than proprietary web crawl datasets used by search engines or AI companies

cdxj index-based url lookup and warc file location resolution

Medium confidence

Provides CDXJ (Capture inDeX JSON) indices that map URLs to their locations within WARC files, enabling random access to specific crawled pages without scanning entire archives. The index structure stores URL metadata and WARC file offsets, allowing efficient retrieval of individual pages from petabyte-scale datasets. Users query the index to locate a URL, then fetch only the relevant WARC segment from S3.

Solves for

Find if a specific URL was captured in a particular crawl cycleRetrieve a single page without downloading entire WARC filesBatch lookup of multiple URLs across crawl archivesDetermine which WARC file contains a target URL

Best for

Researchers needing specific pages from historical crawls

Dataset builders filtering Common Crawl by URL patterns

Web analysis tools requiring efficient page retrieval

Requires

AWS S3 access to commoncrawl bucket

CDXJ format parser or library

Understanding of WARC file structure and byte offsets

Limitations

CDXJ query syntax and API not documented in provided materials

No REST API wrapper — requires direct S3 index file access

Index files themselves are large and require parsing

What makes it unique

Uses CDXJ (JSON-based capture index) format for URL-to-WARC mapping, enabling O(log n) lookup instead of linear WARC scanning. This approach allows researchers to retrieve individual pages from petabyte archives without downloading entire monthly snapshots, making Common Crawl accessible to resource-constrained teams.

vs alternatives

More efficient than downloading full WARC files and more standardized than proprietary index formats used by commercial web archives

columnar index for metadata-based filtering and analytics

Medium confidence

Provides a columnar index structure (format and technical details unknown from documentation) that enables efficient filtering and aggregation across crawl metadata without accessing raw WARC content. Allows queries on metadata dimensions like domain, content type, HTTP status codes, and capture timestamps. Designed for analytical workloads that need statistics or filtered subsets of the crawl without full content retrieval.

Solves for

Count pages by domain or content type across a crawl cycleFilter crawl results by HTTP status code or response headersIdentify which domains have the most pages in a crawlAnalyze crawl coverage and distribution statistics

Best for

Data analysts studying web structure and composition

Dataset builders needing domain-level filtering

Researchers analyzing crawl quality and coverage

Requires

AWS S3 access to columnar index files

Unknown: specific query tool or library

Unknown: query language or API format

Limitations

Columnar index structure not documented — implementation details unknown

Query interface and syntax unknown

No API documentation provided

What makes it unique

Unknown — insufficient data. Documentation mentions columnar index existence but provides no technical specification, query interface, or usage examples.

vs alternatives

Unknown — insufficient data to compare against alternative indexing approaches

web graph extraction with domain-level link structure and backlink analysis

Medium confidence

Extracts domain-level link graph from crawl data, capturing which domains link to which other domains and backlink relationships. Produces graph data (format unknown) representing the web's connectivity structure. Enables analysis of domain authority, link patterns, and web topology without processing raw page content. Referenced as 'BacklinkDB' in documentation but technical details not provided.

Solves for

Analyze link structure and domain authority patternsIdentify high-authority domains for dataset quality filteringStudy web topology and connectivity evolution over timeBuild domain-level relationship graphs for network analysis

Best for

Web graph researchers studying internet topology

SEO and link analysis researchers

Teams filtering training data by domain authority

Requires

AWS S3 access to graph data files

Unknown: graph parsing or database library

Unknown: specific format parser

Limitations

Graph data format not specified in documentation

No API or query interface documented

Extraction methodology unknown

What makes it unique

Unknown — insufficient data. Documentation references BacklinkDB and web graph extraction but provides no technical specification, format details, or usage documentation.

vs alternatives

Unknown — insufficient data to compare against alternative graph extraction approaches

raw warc file storage and s3-based distributed access

Medium confidence

Stores all crawled web content in WARC (Web ARChive) format on AWS S3 public buckets, enabling distributed access without centralized bottlenecks. WARC is the ISO 28500 standard for web archival, containing HTTP requests, responses, headers, and payloads in a sequential record format. S3 storage provides global availability, parallel download capability, and HTTP range request support for partial file retrieval. Users access files directly via S3 API or HTTP without intermediary services.

Solves for

Download raw web content for local processing and analysisBuild custom text extraction and cleaning pipelinesAccess historical web content for research or archival purposesIntegrate Common Crawl data into distributed processing frameworks (Spark, Hadoop)

Best for

ML teams building custom training datasets

Researchers requiring raw, unprocessed web content

Organizations with infrastructure to process petabyte-scale data

Requires

AWS S3 access (public bucket read permissions available)

WARC format parser library (warcio, jwat, or equivalent)

Python 3.6+ or equivalent language runtime

Limitations

Raw WARC files contain unfiltered content — spam, duplicates, malicious pages, and low-quality text

No built-in text extraction or cleaning — requires custom processing

No deduplication at Common Crawl level — must be done downstream

What makes it unique

Uses standard ISO 28500 WARC format stored on public AWS S3 buckets, avoiding proprietary formats and enabling use of standard archive tools. This approach prioritizes interoperability and long-term preservation over convenience, allowing any tool that understands WARC to access the data without vendor lock-in.

vs alternatives

More standardized and openly accessible than proprietary web crawl formats used by search engines or commercial data providers, and more durable than centralized APIs that could be deprecated

robots.txt and opt-out registry compliance for crawl exclusion

Medium confidence

Implements crawl exclusion mechanisms respecting robots.txt directives and a maintained opt-out registry where domain owners can request exclusion from future crawls. CCBot crawler agent checks robots.txt before crawling and consults the opt-out registry to avoid capturing content from domains that have requested exclusion. Provides a submission mechanism (details unknown) for domains to register opt-out requests.

Solves for

Exclude your domain from future Common Crawl capturesEnsure compliance with robots.txt directives in crawl dataVerify that sensitive or private domains are not included in archivesManage web presence in public archives

Best for

Domain owners concerned about data privacy or archival

Organizations with sensitive content requiring exclusion

Researchers studying crawl compliance and robots.txt effectiveness

Requires

Domain ownership or administrative access

robots.txt file on domain (for robots.txt compliance)

Unknown: specific opt-out submission mechanism or form

Limitations

Opt-out registry submission process not documented

No information on opt-out processing lag or effectiveness

Historical crawls may contain excluded content before opt-out was processed

What makes it unique

Maintains an explicit opt-out registry separate from robots.txt, providing domain owners with a dedicated mechanism to request exclusion from future crawls. This dual-mechanism approach (robots.txt + registry) offers both technical and administrative control, though the registry submission process and enforcement details are not publicly documented.

vs alternatives

More transparent than search engine crawlers regarding exclusion mechanisms, though less documented than robots.txt standard itself

hugging face integration for dataset discovery and download

Medium confidence

Provides integration with Hugging Face Hub enabling discovery and download of Common Crawl data through the Hugging Face ecosystem. Specific integration details, API format, and available datasets unknown from documentation. Allows researchers to access Common Crawl data through familiar Hugging Face tools and interfaces rather than direct S3 access.

Solves for

Discover Common Crawl datasets through Hugging Face Hub interfaceDownload Common Crawl data using Hugging Face datasets libraryIntegrate Common Crawl into Hugging Face-based ML pipelinesAccess pre-processed or filtered Common Crawl subsets

Best for

ML researchers already using Hugging Face ecosystem

Teams building models with Transformers library

Researchers preferring Hugging Face interface over direct S3 access

Requires

Hugging Face account (free)

Hugging Face datasets library (Python)

Internet connection for Hub access

Limitations

Integration details not documented — specific datasets and formats unknown

Unclear whether integration provides raw WARC or pre-processed data

No information on data freshness or update frequency

What makes it unique

Unknown — insufficient data. Documentation mentions Hugging Face integration exists but provides no technical specification, available datasets, or usage examples.

vs alternatives

Unknown — insufficient data to compare against alternative integration approaches

community support and documentation via mailing list, discord, and faq

Medium confidence

Provides community support infrastructure including a mailing list archive, Discord community channel, and FAQ section addressing common questions about data access, format, and usage. Enables peer-to-peer support and knowledge sharing among researchers and practitioners using Common Crawl. Blog with examples provides practical guidance on common tasks.

Solves for

Get help with WARC parsing and data extractionLearn best practices for processing Common Crawl dataConnect with other researchers using Common CrawlFind answers to common questions about data access and formats

Best for

Researchers new to Common Crawl and WARC format

Teams troubleshooting data access or processing issues

Community members wanting to share knowledge and examples

Requires

Internet connection

Email for mailing list (optional)

Discord account for community chat (optional)

Limitations

Community support is volunteer-based — response times unknown

FAQ content not provided in documentation

Blog examples may be outdated or incomplete

What makes it unique

Operates as a non-profit with community-driven support model rather than commercial support tiers. Provides multiple communication channels (mailing list, Discord, FAQ, blog) enabling asynchronous and synchronous help, though without formal SLAs or guaranteed response times.

vs alternatives

More accessible and community-oriented than commercial data providers, though less formal than enterprise support offerings

foundational data source for major language model training datasets

Medium confidence

Serves as the primary raw data source for downstream dataset creation pipelines including C4, The Pile, RedPajama, FineWeb, and Dolma. These datasets apply text extraction, deduplication, filtering, and quality curation on top of Common Crawl's raw WARC archives to produce cleaned, deduplicated text suitable for language model training. Common Crawl provides the petabyte-scale raw material; downstream projects handle cleaning and curation.

Solves for

Access the raw web data used to train major language modelsUnderstand the data lineage and source of LLM training corporaBuild custom training datasets by applying custom filtering to Common CrawlReproduce or audit the data composition of published language models

Best for

ML researchers training language models

Teams auditing or reproducing LLM training data

Organizations building custom training datasets

Requires

Understanding of WARC format and web content structure

Text extraction and cleaning pipeline

Deduplication infrastructure (e.g., MinHash, bloom filters)

Limitations

Common Crawl is raw, unprocessed data — requires significant downstream processing

No built-in text extraction, deduplication, or quality filtering

Downstream datasets (C4, The Pile, etc.) apply different filtering and curation approaches — not directly comparable

What makes it unique

Provides the foundational petabyte-scale raw material for virtually every major open-source language model training dataset (C4, The Pile, RedPajama, FineWeb, Dolma). Unlike these downstream datasets, Common Crawl remains raw and unprocessed, allowing researchers to apply custom filtering and curation rather than being locked into pre-defined dataset compositions.

vs alternatives

More comprehensive and openly accessible than proprietary web crawls used by commercial AI companies, and more flexible than pre-curated datasets that apply fixed filtering criteria

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Common Crawl, ranked by overlap. Discovered automatically through the match graph.

Dataset46

FineWeb

Hugging Face's 15T token dataset, new standard for LLM training.

temporal coverage across 96 common crawl snapshotsmulti-stage web data filtering pipelinescalable distributed processing pipeline

3 shared capabilities

API42

Firecrawl

API to turn websites into LLM-ready markdown — crawl, scrape, and map with JS rendering.

full-site crawling with metadata extractionweb search with full-page content retrieval

2 shared capabilities

Dataset26

MINT-1T-PDF-CC-2023-50

Dataset by mlfoundations. 7,96,577 downloads.

common crawl pdf document sourcing and deduplication

1 shared capability

Product20

You.com

A search engine built on AI that provides users with a customized search experience while keeping their data 100% private.

web crawler and index maintenance

1 shared capability

Dataset26

MINT-1T-PDF-CC-2023-06

Dataset by mlfoundations. 5,39,406 downloads.

common crawl snapshot integration and temporal consistency

1 shared capability

Dataset26

c4

Dataset by allenai. 6,98,456 downloads.

multilingual web-scale text corpus ingestion and deduplication

1 shared capability

Best For

✓ML researchers training language models at scale
✓Web archive researchers studying internet history
✓Academic institutions with petabyte-scale processing infrastructure
✓Non-profit organizations building open datasets
✓Researchers needing specific pages from historical crawls
✓Dataset builders filtering Common Crawl by URL patterns
✓Web analysis tools requiring efficient page retrieval
✓Teams building downstream datasets (C4, The Pile, RedPajama, FineWeb, Dolma)

Known Limitations

⚠Monthly frequency only — not real-time or continuous capture
⚠Crawl lag between capture and indexing availability unknown
⚠Limited to publicly crawlable web — excludes authenticated, paywalled, and dynamically-rendered content
⚠No built-in JavaScript rendering — captures static HTML only
⚠Raw WARC files contain unfiltered content including spam, duplicates, and malicious pages
⚠CDXJ query syntax and API not documented in provided materials

Requirements

AWS S3 access credentials or public bucket read permissionsWARC format parsing library (e.g., warcio for Python)Significant storage capacity (petabytes for full archive)Network bandwidth budget for S3 egress costsAWS S3 access to commoncrawl bucketCDXJ format parser or libraryUnderstanding of WARC file structure and byte offsetsHTTP range request capability for partial WARC downloads

Input / Output

Accepts: domain names, URL patterns, crawl cycle identifiers, URL strings, URL patterns or prefixes, metadata field names, filter predicates, aggregation dimensions, WARC file paths, S3 bucket URIs, robots.txt directives, dataset identifiers, Hugging Face Hub queries, questions, problem descriptions, WARC files

Produces: WARC files (Web ARChive format), HTTP response bodies, HTTP headers and metadata, CDXJ index records, WARC file paths, byte offsets within WARC files, capture timestamps, aggregated statistics, filtered metadata records, domain/content-type distributions, graph data (format unknown), link relationships, backlink counts, WARC record objects, raw HTML/text content, crawl exclusion confirmation, opt-out registry status, dataset objects, pre-processed data, metadata, community responses, documentation links, code examples, cleaned text, deduplicated documents, filtered datasets

UnfragileRank

Adoption70%(35% weight)

Quality28%(25% weight)

Ecosystem40%(20% weight)

Match Graph10%(15% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Dataset

9 capabilities

Visit Common Crawl→

About

Non-profit organization maintaining the largest open web crawl archive, containing petabytes of raw web data collected since 2008. Monthly crawls capture 3-5 billion web pages each. The foundational data source behind virtually every major language model training dataset including C4, The Pile, RedPajama, FineWeb, and Dolma. Stored on AWS S3 as WARC files with URL indices. Free to access but requires significant processing to extract clean text suitable for model training.

Alternatives to Common Crawl

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

FLUX, Stable Diffusion, SDXL, SD3, LoRA, Fine Tuning, DreamBooth, Training, Automatic1111, Forge WebUI, SwarmUI, DeepFake, TTS, Animation, Text To Video, Tutorials, Guides, Lectures, Courses, ComfyUI, Google Colab, RunPod, Kaggle, NoteBooks, ControlNet, TTS, Voice Cloning, AI, AI News, ML, ML News,

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

Are you the builder of Common Crawl?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities9 decomposed

petabyte-scale monthly web crawl capture with warc archival

Medium confidence

Solves for

Best for

ML researchers training language models at scale

Web archive researchers studying internet history

Academic institutions with petabyte-scale processing infrastructure

Requires

AWS S3 access credentials or public bucket read permissions

WARC format parsing library (e.g., warcio for Python)

Significant storage capacity (petabytes for full archive)

Limitations

Monthly frequency only — not real-time or continuous capture

Crawl lag between capture and indexing availability unknown

Limited to publicly crawlable web — excludes authenticated, paywalled, and dynamically-rendered content

What makes it unique

vs alternatives

Larger and more freely accessible than Wayback Machine (which focuses on specific URL preservation), and more standardized than proprietary web crawl datasets used by search engines or AI companies

cdxj index-based url lookup and warc file location resolution

Medium confidence

Solves for

Best for

Researchers needing specific pages from historical crawls

Dataset builders filtering Common Crawl by URL patterns

Web analysis tools requiring efficient page retrieval

Requires

AWS S3 access to commoncrawl bucket

CDXJ format parser or library

Understanding of WARC file structure and byte offsets

Limitations

CDXJ query syntax and API not documented in provided materials

No REST API wrapper — requires direct S3 index file access

Index files themselves are large and require parsing

What makes it unique

vs alternatives

More efficient than downloading full WARC files and more standardized than proprietary index formats used by commercial web archives

columnar index for metadata-based filtering and analytics

Medium confidence

Solves for

Best for

Data analysts studying web structure and composition

Dataset builders needing domain-level filtering

Researchers analyzing crawl quality and coverage

Requires

AWS S3 access to columnar index files

Unknown: specific query tool or library

Unknown: query language or API format

Limitations

Columnar index structure not documented — implementation details unknown

Query interface and syntax unknown

No API documentation provided

What makes it unique

Unknown — insufficient data. Documentation mentions columnar index existence but provides no technical specification, query interface, or usage examples.

vs alternatives

Unknown — insufficient data to compare against alternative indexing approaches

web graph extraction with domain-level link structure and backlink analysis

Medium confidence

Solves for

Best for

Web graph researchers studying internet topology

SEO and link analysis researchers

Teams filtering training data by domain authority

Requires

AWS S3 access to graph data files

Unknown: graph parsing or database library

Unknown: specific format parser

Limitations

Graph data format not specified in documentation

No API or query interface documented

Extraction methodology unknown

What makes it unique

Unknown — insufficient data. Documentation references BacklinkDB and web graph extraction but provides no technical specification, format details, or usage documentation.

vs alternatives

Unknown — insufficient data to compare against alternative graph extraction approaches

raw warc file storage and s3-based distributed access

Medium confidence

Solves for

Best for

ML teams building custom training datasets

Researchers requiring raw, unprocessed web content

Organizations with infrastructure to process petabyte-scale data

Requires

AWS S3 access (public bucket read permissions available)

WARC format parser library (warcio, jwat, or equivalent)

Python 3.6+ or equivalent language runtime

Limitations

Raw WARC files contain unfiltered content — spam, duplicates, malicious pages, and low-quality text

No built-in text extraction or cleaning — requires custom processing

No deduplication at Common Crawl level — must be done downstream

What makes it unique

vs alternatives

More standardized and openly accessible than proprietary web crawl formats used by search engines or commercial data providers, and more durable than centralized APIs that could be deprecated

robots.txt and opt-out registry compliance for crawl exclusion

Medium confidence

Solves for

Best for

Domain owners concerned about data privacy or archival

Organizations with sensitive content requiring exclusion

Researchers studying crawl compliance and robots.txt effectiveness

Requires

Domain ownership or administrative access

robots.txt file on domain (for robots.txt compliance)

Unknown: specific opt-out submission mechanism or form

Limitations

Opt-out registry submission process not documented

No information on opt-out processing lag or effectiveness

Historical crawls may contain excluded content before opt-out was processed

What makes it unique

vs alternatives

More transparent than search engine crawlers regarding exclusion mechanisms, though less documented than robots.txt standard itself

hugging face integration for dataset discovery and download

Medium confidence

Solves for

Best for

ML researchers already using Hugging Face ecosystem

Teams building models with Transformers library

Researchers preferring Hugging Face interface over direct S3 access

Requires

Hugging Face account (free)

Hugging Face datasets library (Python)

Internet connection for Hub access

Limitations

Integration details not documented — specific datasets and formats unknown

Unclear whether integration provides raw WARC or pre-processed data

No information on data freshness or update frequency

What makes it unique

Unknown — insufficient data. Documentation mentions Hugging Face integration exists but provides no technical specification, available datasets, or usage examples.

vs alternatives

Unknown — insufficient data to compare against alternative integration approaches

community support and documentation via mailing list, discord, and faq

Medium confidence

Solves for

Best for

Researchers new to Common Crawl and WARC format

Teams troubleshooting data access or processing issues

Community members wanting to share knowledge and examples

Requires

Internet connection

Email for mailing list (optional)

Discord account for community chat (optional)

Limitations

Community support is volunteer-based — response times unknown

FAQ content not provided in documentation

Blog examples may be outdated or incomplete

What makes it unique

vs alternatives

More accessible and community-oriented than commercial data providers, though less formal than enterprise support offerings

foundational data source for major language model training datasets

Medium confidence

Solves for

Best for

ML researchers training language models

Teams auditing or reproducing LLM training data

Organizations building custom training datasets

Requires

Understanding of WARC format and web content structure

Text extraction and cleaning pipeline

Deduplication infrastructure (e.g., MinHash, bloom filters)

Limitations

Common Crawl is raw, unprocessed data — requires significant downstream processing

No built-in text extraction, deduplication, or quality filtering

Downstream datasets (C4, The Pile, etc.) apply different filtering and curation approaches — not directly comparable

What makes it unique

vs alternatives

More comprehensive and openly accessible than proprietary web crawls used by commercial AI companies, and more flexible than pre-curated datasets that apply fixed filtering criteria

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to Common Crawl

cua53Agent

Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).

Compare →

Hugging Face43Platform

The GitHub for AI — 500K+ models, datasets, Spaces, Inference API, hub for open-source AI.

Compare →

Stable-Diffusion55Repository

Compare →

YOLOv846Model

Real-time object detection, segmentation, and pose.

Compare →

Common Crawl

Capabilities9 decomposed

petabyte-scale monthly web crawl capture with warc archival

cdxj index-based url lookup and warc file location resolution

columnar index for metadata-based filtering and analytics

web graph extraction with domain-level link structure and backlink analysis

raw warc file storage and s3-based distributed access

robots.txt and opt-out registry compliance for crawl exclusion

hugging face integration for dataset discovery and download

community support and documentation via mailing list, discord, and faq

foundational data source for major language model training datasets

Related Artifactssharing capabilities

FineWeb

Firecrawl

MINT-1T-PDF-CC-2023-50

You.com

MINT-1T-PDF-CC-2023-06

c4

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Common Crawl

Are you the builder of Common Crawl?

Get the weekly brief

Data Sources

Common Crawl

Capabilities9 decomposed

petabyte-scale monthly web crawl capture with warc archival

cdxj index-based url lookup and warc file location resolution

columnar index for metadata-based filtering and analytics

web graph extraction with domain-level link structure and backlink analysis

raw warc file storage and s3-based distributed access

robots.txt and opt-out registry compliance for crawl exclusion

hugging face integration for dataset discovery and download

community support and documentation via mailing list, discord, and faq

foundational data source for major language model training datasets

Related Artifactssharing capabilities

FineWeb

Firecrawl

MINT-1T-PDF-CC-2023-50

You.com

MINT-1T-PDF-CC-2023-06

c4

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Common Crawl

Are you the builder of Common Crawl?

Get the weekly brief

Data Sources