grammar-constrained text generation with token healing
Generates text from LLMs while enforcing constraints defined as an AST of GrammarNode subclasses (LiteralNode, RegexNode, SelectNode, JsonNode). Uses a token healing mechanism that operates at the text level rather than token level to correctly handle text boundaries, preventing invalid token sequences at constraint edges. The TokenParser and ByteParser engines integrate constraints directly into the generation loop, ensuring every token respects the grammar before being produced.
Unique: Implements token healing at the text level (not token level) with an immutable GrammarNode AST architecture, allowing constraints to be composed and reused across programs while maintaining correct behavior at token boundaries. The TokenParser/ByteParser dual-engine design handles both token-level and byte-level constraints without requiring external validation passes.
vs alternatives: More efficient than post-generation validation (no retry loops) and more flexible than simple prompt engineering because constraints are enforced during generation, not after, reducing wasted tokens and guaranteeing format compliance on first attempt.
stateful execution with interleaved control flow and generation
Maintains model state through immutable lm objects that accumulate generated text, captured variables, and execution context across multiple generation steps. The @guidance decorator transforms Python functions into programs that interleave traditional control flow (conditionals, loops, function calls) with constrained text generation, executing them in a unified stateful context. Each step in the program updates the lm state object, which carries forward to subsequent steps, enabling dynamic decision-making based on previous generations.
Unique: Uses immutable lm state objects that accumulate text and captures across decorated function boundaries, enabling Python control flow (if/else, for loops, function calls) to be seamlessly interleaved with generation. The @guidance decorator acts as a compiler that transforms Python functions into stateful generation programs without requiring explicit state threading.
vs alternatives: More expressive than simple prompt templates because it allows arbitrary Python logic to drive generation decisions, and more maintainable than hand-rolled state management because the decorator handles state threading automatically across function boundaries.
ebnf grammar definition and composition
Allows developers to define reusable grammar rules using Extended Backus-Naur Form (EBNF) syntax, which are compiled into GrammarNode ASTs. Rules can reference other rules, enabling composition of complex grammars from simpler components. The EBNF parser (guidance/library/_ebnf.py) converts textual grammar definitions into executable constraints. Rules are stored in a grammar registry and can be reused across multiple Guidance programs.
Unique: Provides EBNF syntax for defining grammars that are compiled into GrammarNode ASTs, enabling developers to express complex constraints using a standard formal notation. Rules are composable and reusable across programs via a grammar registry.
vs alternatives: More expressive and maintainable than nested Python grammar objects because EBNF is a standard notation, and more flexible than hardcoded format strings because rules can be parameterized and composed.
token-level and byte-level parsing with dual-engine architecture
Implements two parsing engines (TokenParser and ByteParser) that operate at different levels of abstraction. TokenParser works at the token level, validating that generated tokens conform to grammar constraints. ByteParser operates at the byte level, handling sub-token constraints and ensuring correct behavior at character boundaries. The dual-engine design allows constraints to be expressed at the appropriate level of abstraction while maintaining correctness across token boundaries.
Unique: Implements a dual-engine architecture (TokenParser and ByteParser) that operates at both token and byte levels, enabling constraints to be enforced at the appropriate abstraction level while maintaining correctness at boundaries. Token healing is implemented through careful coordination between engines.
vs alternatives: More efficient than purely byte-level parsing because token-level constraints are faster, and more correct than purely token-level parsing because byte-level constraints handle edge cases at token boundaries.
llama.cpp and transformers local model inference
Provides native integration with local LLM inference engines (llama.cpp via llama-cpp-python, and Hugging Face Transformers). Enables running Guidance programs against locally-hosted models without cloud API dependencies. Supports model quantization, GPU acceleration, and batch processing. The local model backend handles tokenization, context management, and generation scheduling directly within the Python process.
Unique: Provides native integration with llama.cpp (via llama-cpp-python) and Transformers, enabling local inference with full Guidance constraint support. Handles tokenization, context management, and generation scheduling within the Python process without external service dependencies.
vs alternatives: More cost-effective than cloud APIs for high-volume inference and more privacy-preserving because data never leaves the local machine, though with higher infrastructure requirements.
openai, azure openai, and vertexai remote api integration
Provides unified integration with remote LLM APIs (OpenAI, Azure OpenAI, Google VertexAI) through a common backend interface. Handles API authentication, request formatting, token counting, and response parsing. Supports streaming and non-streaming modes. The remote backend abstracts differences between API protocols while maintaining Guidance's constraint semantics.
Unique: Provides unified backend abstraction for OpenAI, Azure OpenAI, and VertexAI APIs, normalizing differences in authentication, request formatting, and response parsing. Maintains Guidance's constraint semantics across different API protocols.
vs alternatives: More convenient than direct API client usage because Guidance handles constraint enforcement and state management, and more flexible than provider-specific SDKs because the same code works across multiple providers.
capture and variable extraction from constrained generation
Automatically extracts and stores named captures from constrained generation into the lm state object. Supports capturing from regex groups, selected options, JSON fields, and literal text. Captured variables are accessible in subsequent generation steps and control flow branches. The capture mechanism enables dynamic decision-making based on what the model generated in previous steps.
Unique: Automatically extracts named captures from constrained generation (regex groups, JSON fields, selected options) and stores them in the lm state for use in subsequent steps. Enables dynamic workflows where each step uses outputs from previous steps.
vs alternatives: More integrated than post-generation parsing because captures are extracted during generation, and more flexible than hardcoded extraction logic because capture names can be defined in constraints.
multi-backend model abstraction with unified api
Provides a unified interface for executing Guidance programs across heterogeneous LLM backends (local: LlamaCpp, Transformers; remote: OpenAI, Azure OpenAI, VertexAI) without changing program code. The model abstraction layer (guidance/models/_base) defines a common interface that each backend implements, handling differences in tokenization, API protocols, and inference engines. Programs written against the abstract model interface automatically work with any backend by swapping the model initialization parameter.
Unique: Implements a backend abstraction layer (guidance/models/_base/_model.py) that normalizes differences between local inference engines (LlamaCpp, Transformers) and remote APIs (OpenAI, Azure, VertexAI) through a common interface, enabling the same Guidance program to execute unchanged across any backend. Uses dependency injection to swap backends at initialization time.
vs alternatives: More flexible than LangChain's model abstraction because it preserves Guidance's constraint semantics across backends, and more comprehensive than raw API clients because it handles tokenization normalization and state management automatically.
+7 more capabilities