Tree Sitter Based Code Parsing And Semantic Chunking

1

repomixCLI Tool55/100

via “tree-sitter-based code compression and comment stripping”

📦 Repomix is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large Language Models (LLMs) or other AI tools like Claude, ChatGPT, DeepSeek, Perplexity, Gemini, Gemma, Llama, Grok, and more.

Unique: Uses Tree-sitter AST parsing for language-aware comment removal instead of regex patterns, enabling structural understanding of code syntax. Supports 40+ languages natively with automatic fallback to regex-based stripping for unsupported languages, providing consistent compression across heterogeneous codebases.

vs others: More accurate than regex-based comment stripping because it understands language syntax and can distinguish between comments and string literals containing comment-like text. Reduces token consumption by 20-40% compared to naive concatenation while preserving code semantics.

2

codebase-memory-mcpMCP Server51/100

via “multi-language ast parsing and entity extraction with tree-sitter”

High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.

Unique: Uses vendored tree-sitter C bindings compiled into a single static binary, enabling 66-language support without external dependencies or grammar downloads. Integrates incremental parsing to avoid re-parsing unchanged regions during content-hash-based reindexing, achieving ~4× faster incremental updates than full-scan approaches.

vs others: Supports 66 languages in a single binary with zero external dependencies, whereas LSP-based approaches require per-language server installations and Regex-based tools are limited to 5-10 languages with poor structural accuracy.

3

claude-contextMCP Server50/100

via “syntax-aware code chunking with multi-language ast parsing”

Code search MCP for Claude Code. Make entire codebase the context for any coding agent.

Unique: Uses tree-sitter AST parsing to identify semantic boundaries (functions, classes, modules) for chunking instead of fixed-size windows, with language-specific strategies for 40+ languages. Implements LangChain fallback for unsupported languages, ensuring graceful degradation while maintaining chunk quality.

vs others: More precise than fixed-window chunking (e.g., 512-token windows) because it respects syntactic boundaries; more language-agnostic than language-specific parsers because tree-sitter supports 40+ languages with a single abstraction.

4

CodeGraphContextMCP Server50/100

via “multi-language code parsing with tree-sitter ast extraction”

An MCP server plus a CLI tool that indexes local code into a graph database to provide context to AI assistants.

Unique: Uses Tree-sitter's incremental parsing with language-specific grammars for 14 languages, enabling structural awareness of code relationships rather than text-based pattern matching. Normalizes heterogeneous syntax into a unified graph schema through a language-agnostic entity extraction layer.

vs others: Faster and more accurate than regex-based indexing (Sourcegraph, Ctags) because it understands code structure; broader language support than LSP-only solutions while remaining lightweight and offline-capable.

5

ai-engineering-hubMCP Server48/100

via “code-aware rag with syntax-tree-based chunking”

In-depth tutorials on LLMs, RAGs and real-world AI agent applications.

Unique: Uses tree-sitter AST parsing to preserve code structure during chunking, enabling retrieval that understands function/class boundaries and import relationships rather than naive text-based chunking that splits code arbitrarily

vs others: More accurate code retrieval than text-only RAG because structural awareness prevents splitting related code and maintains semantic coherence; outperforms regex-based code search by understanding language syntax deeply

6

code-index-mcpMCP Server46/100

via “tree-sitter ast parsing with language-specific symbol extraction”

A Model Context Protocol (MCP) server that helps large language models index, search, and analyze code repositories with minimal setup

Unique: Uses tree-sitter for structural parsing across 50+ languages with intelligent fallback to regex heuristics for unsupported languages. Caches parsed results in SQLite, enabling fast symbol lookups without re-parsing on every query.

vs others: More accurate than regex-only parsing because tree-sitter understands syntax trees; more practical than language-specific compilers because it requires no build tools or dependencies beyond Python bindings.

7

code-review-graphProduct41/100

via “tree-sitter-based incremental codebase parsing with sha-256 change tracking”

Local knowledge graph for Claude Code. Builds a persistent map of your codebase so Claude reads only what matters — 6.8× fewer tokens on reviews and up to 49× on daily coding tasks.

Unique: Uses Tree-sitter AST parsing with SHA-256 incremental tracking instead of regex or line-based analysis, enabling structural awareness across 40+ languages while avoiding redundant re-parsing of unchanged files. The incremental update system (diagram 4) tracks file hashes to determine which entities need re-extraction, reducing indexing time from O(n) to O(delta) for large codebases.

vs others: Faster and more accurate than LSP-based indexing for offline analysis because it maintains a persistent graph that survives session boundaries and doesn't require a running language server per language.

8

LEANNModel37/100

via “ast-aware code chunking for semantic code indexing”

[MLsys2026]: RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast, accurate, and 100% private RAG application on your personal device.

Unique: Uses tree-sitter AST parsing to chunk code at semantic boundaries (functions, classes, methods) rather than naive line or token splitting, preserving code structure and improving retrieval quality for code-specific RAG — most RAG frameworks use generic text chunking that ignores code semantics

vs others: Produces higher-quality code search results than LangChain's RecursiveCharacterTextSplitter because it respects code structure, enabling retrieval of complete, semantically-meaningful code units

9

llama-indexFramework34/100

via “intelligent document chunking with semantic-aware node parsing”

Interface between LLMs and your data

Unique: Offers pluggable NodeParser strategies including semantic-aware splitting that respects document boundaries and language-specific parsing for code/markdown, with automatic metadata propagation through the node hierarchy

vs others: More sophisticated than LangChain's text splitters by preserving document hierarchy and offering semantic-aware chunking; supports language-specific parsing without external dependencies

10

llama-index-coreFramework34/100

via “hierarchical document chunking with semantic awareness”

Interface between LLMs and your data

Unique: Implements multiple chunking strategies (simple, recursive, semantic, hierarchical) with automatic parent-child relationship tracking, enabling retrieval systems to fetch full context by traversing node relationships. SemanticSplitter uses embedding-based boundary detection rather than token counting.

vs others: More sophisticated than LangChain's text splitters by preserving document hierarchy and supporting semantic boundaries; enables context-aware retrieval that recovers full sections rather than isolated chunks.

11

DocMason – Agent Knowledge Base for local complex office filesRepository34/100

via “chunking and semantic segmentation of document content”

I think everyone has already read Karpathy's Post about LLM Knowledge Bases. Actually for recent weeks I am already working on agent-native knowledge base for complex research (DocMason). And it is purely running in Codex/Claude Code. I call this paradigm is: The repo is the app. Codex is

Unique: Uses structure-aware chunking that respects document hierarchy (sections, tables, lists) and creates overlapping chunks with full provenance metadata, rather than naive token-count splitting that destroys semantic boundaries

vs others: More sophisticated than LangChain's RecursiveCharacterTextSplitter because it understands document structure semantics and preserves table/section integrity, while simpler than enterprise solutions like Unstructured.io that require additional dependencies

12

Repo MapMCP Server33/100

via “tree-sitter-based code definition extraction with language-specific query files”

** -🐧 🪟 🍎 - An MCP server (and command-line tool) to provide a dynamic map of chat-related files from the repository with their function prototypes and related files in order of relevance. Based on the "Repo Map" functionality in Aider.chat

Unique: Uses Tree-sitter AST parsing with language-specific query files (get_tags_raw method in repomap_class.py) instead of regex or heuristic-based extraction, enabling structurally-aware definition and reference extraction across 40+ languages with consistent semantics. The Tag namedtuple structure preserves full context (relative filename, absolute filename, line number, entity name, entity kind) for downstream processing.

vs others: More accurate than regex-based code extraction and faster than LSP-based approaches because it parses locally without network overhead; more portable than language-specific parsers because Tree-sitter provides unified interface across languages.

13

SourcererMCP Server29/100

via “tree-sitter based code parsing and semantic chunking”

** - MCP for semantic code search & navigation that reduces token waste

Unique: Uses Tree-sitter AST parsing instead of regex or simple text splitting, enabling structurally-aware chunking that respects language syntax boundaries and extracts semantic units (functions, classes) with full context preservation

vs others: More accurate than line-based or regex-based chunking because it understands actual code structure; more maintainable than custom parsers because Tree-sitter grammars are community-maintained and battle-tested

14

ScaffoldRepository27/100

via “multi-language source code parsing with ast extraction”

** - Scaffold is a Retrieval-Augmented Generation (RAG) system designed to structural understanding of large codebases. It transforms your source code into a living knowledge graph, allowing for precise, context-aware interactions that go far beyond simple file retrieval.

Unique: Uses tree-sitter-based language-agnostic parsing with fallback strategies for unsupported languages, enabling consistent AST extraction across 15+ languages without custom parser implementation per language. Caches parsed ASTs in memory to avoid re-parsing during incremental updates.

vs others: More accurate than regex-based code analysis and faster than full semantic analysis tools like Roslyn or LLVM, while supporting more languages than language-specific solutions like Jedi (Python-only)

15

ContinueExtension

via “language-agnostic code understanding via tree-sitter ast parsing”

Top Matches

Also Known As

Company