Code Specific Data Extraction And Quality Filtering From The Stack

1

DolmaDataset58/100

via “code-specific data extraction and quality filtering from the stack”

Allen AI's 3T token dataset for fully reproducible LLM training.

Unique: Dolma's integration of The Stack with explicit license filtering (removing GPL) is distinctive because it enables commercial use of code-trained models while maintaining open-source compliance. Most code datasets (e.g., CodeParrot, GitHub Copilot training data) do not document license filtering or provide GPL-free variants. The combination of license filtering with fuzzy deduplication across code repositories is more sophisticated than simple exact-match deduplication.

vs others: Dolma's code data provides license-compliant code training without GPL restrictions, making it suitable for commercial models, whereas The Pile and other generic datasets either include GPL code or lack code data entirely. However, it is smaller and less frequently updated than GitHub's full code index.

2

StarCoderDataDataset57/100

via “quality filtering and code validity assessment”

250GB curated code dataset for StarCoder training.

Unique: Applies language-aware quality filtering (respecting syntax rules for each of 86 languages) rather than language-agnostic heuristics. Integrates license detection to ensure legal compliance, not just code quality.

vs others: More rigorous than CodeSearchNet (which uses simpler heuristics) and more transparent than proprietary datasets like Codex (which don't publish filtering criteria). Balances quality with diversity better than hand-curated datasets.

3

llm-vscodeExtension41/100

via “code attribution checking via bloom filter matching against the stack dataset”

LLM powered development for VS Code

Unique: Integrates Bloom filter-based probabilistic matching against The Stack dataset directly into the VS Code editor workflow, providing real-time attribution checking without requiring external tools or manual searches. Acknowledges false positives transparently and links to detailed verification.

vs others: Provides training data attribution checking that GitHub Copilot does not expose, and integrates it directly into the editor rather than requiring separate tools like the Stack search interface.

Top Matches

Also Known As

Company