multi-format document ingestion and parsing
Automatically loads and parses documents from diverse sources (PDFs, Word docs, HTML, Markdown, code files, databases) into a unified in-memory representation using format-specific loaders and node-based document abstractions. Each document is decomposed into Document objects containing metadata, content, and relationships, enabling downstream processing without format-specific handling in application code.
Unique: Provides a unified loader abstraction (BaseReader interface) that normalizes 100+ data source connectors into a single Document/Node API, eliminating format-specific branching logic in application code. Loaders are composable and chainable, allowing sequential transformations (e.g., load → split → extract metadata → embed).
vs alternatives: Broader out-of-the-box loader coverage than LangChain's document loaders and more structured node-based decomposition than raw text splitting, reducing boilerplate for multi-source RAG pipelines.
intelligent document chunking and node splitting
Splits documents into semantically coherent chunks using multiple strategies (character-based, token-aware, recursive, semantic) with configurable overlap and chunk size. Preserves document hierarchy and metadata through a node tree structure, enabling retrieval systems to maintain context relationships and enable hierarchical re-ranking or parent-document retrieval patterns.
Unique: Implements a node-tree abstraction that preserves document hierarchy and enables parent-document retrieval patterns. Supports multiple splitting strategies (recursive, semantic, code-aware) with pluggable custom splitters, and automatically propagates metadata through the node tree.
vs alternatives: More sophisticated than LangChain's text splitters because it preserves hierarchical relationships and supports semantic splitting; better for complex document structures than simple character-based splitting.
multi-modal document understanding
Processes documents containing mixed content (text, images, tables, code) by extracting and understanding each modality separately, then synthesizing information across modalities. Uses vision models for image understanding, specialized parsers for tables and code, and integrates results into a unified document representation for retrieval and generation.