Capability
Multi Language Code Corpus Assembly With Permissive Licensing Verification
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “multi-language code corpus assembly with permissive licensing verification”
783 GB curated code dataset from 86 languages with PII redaction.
Unique: Explicit permissive-only licensing filter with SPDX validation at collection time, combined with opt-out mechanism for developers — most competing datasets (CodeSearchNet, GitHub-Code) lack developer opt-out and include mixed licensing
vs others: Legally cleaner than CodeSearchNet (mixed GPL/proprietary) and more developer-respectful than GitHub-Code (no opt-out), making it safer for commercial model training