Capability
Privacy Compliant Dataset Generation
16 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “pii removal and privacy-preserving code filtering”
250GB curated code dataset for StarCoder training.
Unique: Applies PII removal at dataset curation time (before public release) rather than relying on downstream model guardrails, reducing the risk of sensitive data being memorized during training. Scope includes not just code but GitHub issues and commits, which often contain more PII than source files.
vs others: More comprehensive than CodeSearchNet (which doesn't explicitly address PII) and more proactive than relying on model-level filtering, reducing legal/compliance risk for organizations using the dataset.