Capability
3 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “code-specific data extraction and quality filtering from the stack”
Allen AI's 3T token dataset for fully reproducible LLM training.
Unique: Dolma's integration of The Stack with explicit license filtering (removing GPL) is distinctive because it enables commercial use of code-trained models while maintaining open-source compliance. Most code datasets (e.g., CodeParrot, GitHub Copilot training data) do not document license filtering or provide GPL-free variants. The combination of license filtering with fuzzy deduplication across code repositories is more sophisticated than simple exact-match deduplication.
vs others: Dolma's code data provides license-compliant code training without GPL restrictions, making it suitable for commercial models, whereas The Pile and other generic datasets either include GPL code or lack code data entirely. However, it is smaller and less frequently updated than GitHub's full code index.
via “quality filtering and code validity assessment”
250GB curated code dataset for StarCoder training.
Unique: Applies language-aware quality filtering (respecting syntax rules for each of 86 languages) rather than language-agnostic heuristics. Integrates license detection to ensure legal compliance, not just code quality.
vs others: More rigorous than CodeSearchNet (which uses simpler heuristics) and more transparent than proprietary datasets like Codex (which don't publish filtering criteria). Balances quality with diversity better than hand-curated datasets.
via “code attribution checking via bloom filter matching against the stack dataset”
LLM powered development for VS Code
Unique: Integrates Bloom filter-based probabilistic matching against The Stack dataset directly into the VS Code editor workflow, providing real-time attribution checking without requiring external tools or manual searches. Acknowledges false positives transparently and links to detailed verification.
vs others: Provides training data attribution checking that GitHub Copilot does not expose, and integrates it directly into the editor rather than requiring separate tools like the Stack search interface.
Building an AI tool with “Code Specific Data Extraction And Quality Filtering From The Stack”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.