Capability
Structured Data Extraction From Pdfs
20 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “table extraction and normalization to structured formats”
A library that prepares raw documents for downstream ML tasks.
Unique: Uses format-specific table detection (pdfplumber's table grid analysis for PDFs, lxml's table parsing for HTML) combined with a unified normalization layer that handles merged cells and multi-row headers
vs others: Handles complex table layouts (merged cells, multi-row headers) better than simple regex-based extraction, and provides unified output across PDF, HTML, and DOCX formats