What can ai-agents-from-scratch do?

local-llm-inference-via-node-llama-cpp, function-calling-with-tool-schema-binding, hybrid-local-cloud-model-switching, progressive-learning-path-with-modular-examples, persistent-conversation-memory-with-message-history, react-pattern-agent-orchestration, streaming-token-generation-with-async-iteration, system-prompt-specialization-for-task-adaptation, batch-parallel-processing-with-concurrent-inference, model-selection-and-quantization-strategy-guidance, temperature-and-sampling-parameter-control, token-counting-and-context-window-management

ai-agents-from-scratch

AgentFree

Demystify AI agents by building them yourself. Local LLMs, no black boxes, real understanding of function calling, memory, and ReAct patterns.

Open Source

/ 100

12 capabilities

Capabilities12 decomposed

local-llm-inference-via-node-llama-cpp

Medium confidence

Executes quantized GGUF language models locally using node-llama-cpp bindings to the llama.cpp C++ runtime, with platform-specific acceleration (Metal on macOS, CUDA/Vulkan on Linux/Windows). Models run entirely on-device without cloud API calls, enabling privacy-preserving inference with configurable temperature, token limits, and streaming output. The architecture abstracts the underlying C++ runtime through JavaScript bindings, handling model loading, memory management, and token generation.

Solves for

Run LLM inference locally without sending data to cloud providersBuild agents with privacy guarantees and no API rate limitsExperiment with different quantized models without cloud costsUnderstand how LLM inference actually works at the binary level

Best for

developers building privacy-sensitive AI agents

educators teaching LLM fundamentals without cloud dependencies

teams prototyping agents with cost constraints

Requires

Node.js 18+

node-llama-cpp npm package (includes pre-compiled binaries)

8GB+ RAM for 7B models, 16GB+ for 13B+ models

Limitations

Inference speed depends on local hardware; CPU-only inference is 10-50x slower than GPU-accelerated cloud APIs

Memory footprint scales with model size; 7B parameter models require ~8GB RAM minimum

No built-in batching or request queuing — single-threaded inference per model instance

What makes it unique

Uses node-llama-cpp bindings to llama.cpp's optimized C++ runtime rather than pure JavaScript inference, enabling hardware acceleration (Metal/CUDA/Vulkan) and efficient token generation on consumer hardware. The repository explicitly teaches this as the foundation layer, with examples showing model loading, context window management, and streaming token iteration.

vs alternatives

Faster and more memory-efficient than pure JavaScript LLM implementations (e.g., ONNX Runtime), and more transparent than cloud APIs because the entire inference pipeline runs locally with visible code.

function-calling-with-tool-schema-binding

Medium confidence

Implements structured function calling by embedding tool schemas in system prompts and parsing LLM-generated function calls from text output. The architecture defines tools as JavaScript objects with name, description, and parameters, then instructs the LLM to output function calls in a parseable format (typically JSON or XML). A tool execution framework intercepts these outputs, validates them against the schema, and executes the corresponding JavaScript functions, returning results back to the LLM for further reasoning.

Solves for

Enable agents to call external tools (APIs, databases, file systems) based on reasoningUnderstand how function calling works without relying on proprietary APIs like OpenAI's function_callingBuild agents that can dynamically choose which tools to use based on task requirementsImplement ReAct-style agents that reason about tool use before execution

Best for

developers learning agent architecture from first principles

teams building agents with local LLMs that lack native function calling

educators demonstrating tool use patterns without cloud API dependencies

Requires

Node.js 18+

Tool definitions as JavaScript objects with {name, description, parameters, execute} structure

System prompt that instructs LLM to output function calls in a specific format (JSON/XML)

Limitations

Parsing function calls from text is fragile — LLM may generate malformed JSON or hallucinate function names

No built-in retry logic if parsing fails; requires manual error handling and re-prompting

Schema validation is manual; no automatic type coercion or constraint enforcement

What makes it unique

Implements function calling as a text-parsing pattern rather than relying on proprietary APIs, making it transparent and portable across any LLM. The repository includes explicit examples (simple-agent module) showing schema definition, prompt engineering for tool calls, and error handling — teaching the mechanics rather than hiding them in a framework.

vs alternatives

More transparent and educational than OpenAI's function_calling API, and works with any local LLM; less reliable than native function calling because it depends on text parsing, but enables understanding of how function calling actually works.

hybrid-local-cloud-model-switching

Medium confidence

Enables switching between local LLMs (via node-llama-cpp) and cloud APIs (OpenAI, Anthropic) through a unified interface, allowing developers to compare quality/speed tradeoffs or fall back to cloud when local inference is insufficient. The architecture abstracts the model backend behind a common interface, with conditional logic to route requests to either local or cloud providers based on configuration. This pattern allows the same agent code to work with different model sources without modification.

Solves for

Compare local vs cloud LLM quality and performance for specific tasksImplement fallback strategies (use local by default, fall back to cloud on error)Prototype with cloud APIs then migrate to local models for productionUnderstand tradeoffs between privacy (local) and capability (cloud)

Best for

developers evaluating local vs cloud LLM tradeoffs

teams building hybrid systems with fallback capabilities

builders prototyping with cloud APIs then optimizing with local models

Requires

Node.js 18+

node-llama-cpp for local inference

OpenAI SDK or Anthropic SDK for cloud APIs

Limitations

Abstraction adds complexity; unified interface may not expose all model-specific features

Cost tracking becomes complex with multiple providers; requires separate accounting for local vs cloud usage

Quality differences between models may require different prompts or parameters; unified interface may not accommodate all variations

What makes it unique

Demonstrates hybrid architectures through the openai-intro module, showing how to use OpenAI API as an alternative to local inference. The repository explicitly compares local vs cloud approaches, enabling developers to understand when each is appropriate.

vs alternatives

More flexible than pure local or pure cloud approaches, enabling experimentation and fallback; requires more code to manage multiple providers, but enables informed decision-making about deployment strategy.

progressive-learning-path-with-modular-examples

Medium confidence

Structures agent development as a nine-module learning progression, where each module introduces exactly one new concept (basic LLM interaction → function calling → memory → ReAct). The architecture uses consistent module structure (executable .js file, detailed CODE.md walkthrough, conceptual CONCEPT.md explanation) to enable self-paced learning with multiple entry points. Each module builds on previous ones, creating a scaffolded learning experience from fundamentals to autonomous agents.

Solves for

Learn agent development from first principles without skipping foundational conceptsUnderstand how each component (prompts, tools, memory, reasoning) contributes to agent behaviorExperiment with working code examples at each stage of complexityBuild mental models of agent architecture through hands-on implementation

Best for

developers new to AI agents wanting to understand fundamentals

educators teaching agent development and LLM concepts

teams building custom agents and needing architectural understanding

Requires

Node.js 18+

JavaScript/async programming knowledge

Willingness to read and modify code

Limitations

Learning path is linear; skipping modules may create knowledge gaps

Examples are educational, not production-ready; require significant refactoring for real use

No interactive exercises or automated testing; learners must manually verify understanding

What makes it unique

Structures the entire repository as a deliberate learning progression with consistent documentation (CODE.md for implementation details, CONCEPT.md for conceptual understanding), making it explicitly educational rather than just a collection of examples. Each module is self-contained but builds on previous ones.

vs alternatives

More pedagogically structured than most open-source agent projects, with explicit focus on understanding over frameworks; less comprehensive than production frameworks like LangChain, but more transparent and suitable for learning.

persistent-conversation-memory-with-message-history

Medium confidence

Maintains conversation state by storing message history (user and assistant messages) in memory or persistent storage, then including the full or windowed history in each LLM prompt. The architecture uses a message buffer that tracks role (user/assistant), content, and optionally metadata (timestamps, tool calls). Between turns, the system appends new user messages and LLM responses to this buffer, then passes the entire history to the LLM context window, enabling multi-turn reasoning and context awareness.

Solves for

Build agents that remember previous interactions and maintain conversation contextImplement multi-turn reasoning where agents reference earlier steps or decisionsCreate stateful agents that can learn from conversation history within a sessionUnderstand how context management works without external vector databases or RAG

Best for

developers building conversational agents with local LLMs

educators teaching context management and memory patterns

teams prototyping chatbots that need session-level memory

Requires

Node.js 18+

Message history data structure (array of {role, content} objects)

LLM context window large enough for full history + new prompt (typically 2K-4K tokens for multi-turn)

Limitations

No automatic token counting — developers must manually track context window usage or implement token budgeting

Full history approach hits context window limits quickly; requires manual windowing or summarization for long conversations

No built-in persistence — memory is lost when process terminates unless explicitly saved to disk/database

What makes it unique

Implements memory as simple message history appended to each prompt, without vector databases, RAG, or external storage — making it transparent and suitable for educational purposes. The simple-agent-with-memory module explicitly shows how to maintain state across turns and handle context window constraints.

vs alternatives

Simpler and more transparent than RAG-based memory systems, but less scalable for long-term memory; suitable for session-level context but not for persistent knowledge bases across multiple conversations.

react-pattern-agent-orchestration

Medium confidence

Implements the ReAct (Reasoning + Acting) pattern by orchestrating a loop where the LLM reasons about the next step, decides whether to call a tool or return a final answer, executes the tool if needed, and incorporates the result back into the conversation history. The architecture maintains a reasoning trace (visible to the LLM) that shows thought processes, tool calls, and observations, enabling the agent to self-correct and refine its approach iteratively. Each loop iteration appends the LLM's reasoning and tool results to the message history, creating a transparent audit trail.

Solves for

Build autonomous agents that reason about multi-step problems and decide when to use toolsImplement agents that can self-correct by observing tool results and adjusting their approachCreate transparent agent execution with visible reasoning traces for debugging and auditingUnderstand how production agent frameworks (LangChain, AutoGPT) orchestrate reasoning and action

Best for

developers building autonomous agents with local LLMs

educators teaching agent orchestration patterns and reasoning loops

teams prototyping multi-step task automation (research, planning, execution)

Requires

Node.js 18+

Local LLM with sufficient context window (4K+ tokens recommended)

Tool definitions with schema and execution functions

Limitations

No built-in loop termination logic — requires manual max-iterations limit or explicit stop conditions to prevent infinite loops

Tool execution errors are not automatically handled; malformed tool calls or failed executions require explicit error recovery logic

Reasoning traces consume tokens rapidly; long reasoning chains can exhaust context window before reaching a solution

What makes it unique

Implements ReAct as an explicit loop in JavaScript code rather than hiding it in a framework, showing exactly how reasoning, tool selection, and action execution are orchestrated. The react-agent module includes the full loop with error handling, reasoning trace management, and termination logic, making the pattern transparent and modifiable.

vs alternatives

More transparent and educational than LangChain's agent executors because the entire loop is visible and modifiable; less robust than production frameworks because error handling and optimization are manual, but enables deep understanding of agent mechanics.

streaming-token-generation-with-async-iteration

Medium confidence

Streams LLM output tokens in real-time using async iterators, allowing applications to display partial responses as they are generated rather than waiting for the full completion. The architecture uses node-llama-cpp's streaming API to yield tokens as they are produced by the inference engine, enabling progressive rendering, early stopping, and responsive user interfaces. Each token is yielded individually, allowing callers to accumulate them into a full response or process them incrementally.

Solves for

Display LLM responses progressively to users without waiting for full completionImplement responsive chat interfaces that show typing-like behaviorEnable early stopping or user interruption of long-running generationsBuild real-time applications where partial responses are valuable (e.g., code generation, writing assistance)

Best for

developers building interactive chat applications or web interfaces

teams creating responsive user experiences with LLM outputs

builders implementing real-time collaboration features with AI

Requires

Node.js 18+ with async/await and async iterator support

node-llama-cpp with streaming API enabled

Async function to consume the token stream (e.g., for loop with await)

Limitations

Streaming adds complexity to error handling — errors may occur mid-stream, requiring graceful degradation

Token-by-token processing prevents some optimizations (e.g., batch validation, structured output parsing)

Streaming output is harder to validate or constrain to a schema; requires post-processing or specialized parsing

What makes it unique

Exposes node-llama-cpp's streaming API directly through JavaScript async iterators, making token-by-token generation transparent and composable. The coding module demonstrates streaming for code generation, showing how to accumulate tokens and handle partial outputs.

vs alternatives

More efficient than buffering full responses before rendering, and more transparent than cloud APIs that abstract streaming details; requires more manual handling of async patterns but enables fine-grained control over token processing.

system-prompt-specialization-for-task-adaptation

Medium confidence

Adapts LLM behavior by injecting task-specific system prompts that define role, constraints, output format, and reasoning style. The architecture treats system prompts as the primary control mechanism for agent specialization, allowing different prompts to transform the same base model into different specialized agents (translator, reasoner, code generator, etc.). System prompts are prepended to the message history and remain constant across conversation turns, establishing the agent's persona and operational guidelines.

Solves for

Specialize a single LLM model for different tasks without fine-tuningControl agent behavior and output format through prompt engineeringCreate domain-specific agents (translator, mathematician, code reviewer) from a base modelUnderstand how system prompts shape LLM reasoning and output without model changes

Best for

developers building multi-purpose agent systems with task switching

teams experimenting with prompt engineering for behavior control

educators teaching how LLMs respond to instructions and constraints

Requires

Node.js 18+

Well-crafted system prompt text (typically 100-500 tokens)

Understanding of target LLM's capabilities and limitations

Limitations

Prompt engineering is empirical and non-deterministic — same prompt may produce different results across runs or models

System prompts consume tokens; complex prompts reduce available context for user input and conversation history

No guarantee that LLM will follow system prompt constraints; models may ignore instructions or hallucinate

What makes it unique

Treats system prompts as the primary mechanism for agent specialization, with examples (translation, think modules) showing how different prompts transform the same model. The repository emphasizes prompt engineering as a core skill for agent development, with explicit CONCEPT.md documentation for each module's prompt strategy.

vs alternatives

More flexible and transparent than model fine-tuning, and faster to iterate than training custom models; less reliable than fine-tuning for complex behaviors, but enables rapid experimentation and task switching without retraining.

batch-parallel-processing-with-concurrent-inference

Medium confidence

Processes multiple independent requests concurrently using Promise.all() or similar patterns, allowing multiple inference tasks to run in parallel (subject to hardware constraints). The architecture spawns multiple LLM inference tasks simultaneously, each with its own prompt and context, then collects results as they complete. This pattern is useful for embarrassingly parallel workloads (e.g., processing a batch of documents, generating multiple variations) where tasks are independent and can share the same model instance.

Solves for

Process multiple independent LLM requests concurrently without sequential waitingImplement batch processing pipelines for document analysis, content generation, or data transformationMaximize hardware utilization by running multiple inference tasks in parallelBuild scalable agent systems that handle multiple user requests simultaneously

Best for

developers building batch processing pipelines with local LLMs

teams processing large document collections or datasets with AI

builders implementing multi-user agent systems with resource sharing

Requires

Node.js 18+ with Promise and async/await support

Sufficient RAM to load model once and run multiple inference tasks (typically 2-4x base model memory)

Task queue or concurrency limiter to prevent resource exhaustion (e.g., p-limit library)

Limitations

Concurrency is limited by available RAM and CPU; too many parallel tasks cause memory exhaustion or context switching overhead

node-llama-cpp may not support true parallelism on all platforms; some implementations serialize inference despite async code

No built-in load balancing or queue management; developers must manually limit concurrent tasks to avoid resource exhaustion

What makes it unique

Demonstrates concurrent inference using standard JavaScript Promise patterns (Promise.all) rather than specialized frameworks, showing how to parallelize LLM tasks with explicit concurrency control. The batch module includes examples of processing multiple requests and handling results/errors.

vs alternatives

Simpler and more transparent than distributed inference frameworks, but limited by single-machine resources; suitable for batch processing on local hardware, not for large-scale distributed workloads.

model-selection-and-quantization-strategy-guidance

Medium confidence

Provides educational guidance on selecting appropriate quantized GGUF models based on task requirements, hardware constraints, and quality/speed tradeoffs. The architecture documents model characteristics (parameter count, quantization level, context window, inference speed) and helps developers choose between models like Mistral, Llama 2, Phi, and others. The repository includes a model download utility (npx node-llama-cpp pull) that surfaces model options and their specifications, enabling informed selection without trial-and-error.

Solves for

Choose appropriate LLM models for specific tasks and hardware constraintsUnderstand quantization tradeoffs (model size vs quality vs speed)Evaluate different models for agent development without extensive benchmarkingLearn how model selection impacts agent performance and resource usage

Best for

developers new to local LLM deployment making first model selection

teams evaluating models for production agent systems

educators teaching model selection criteria and quantization concepts

Requires

Node.js 18+

npx node-llama-cpp pull command to browse available models

Understanding of model parameters (7B, 13B, 70B) and quantization levels (Q4, Q5, Q8)

Limitations

Guidance is general; optimal model depends on specific task and cannot be determined without testing

Quantization quality varies by model and quantization method; no universal quality metric

Inference speed benchmarks are hardware-dependent; speeds vary significantly across CPU/GPU configurations

What makes it unique

Provides explicit educational guidance on model selection and quantization through DOWNLOAD.md and Model Management documentation, teaching the reasoning behind choices rather than prescribing a single model. The repository includes concrete examples of different models (Mistral, Llama 2, Phi) used across modules.

vs alternatives

More transparent and educational than cloud APIs that abstract model selection, and more practical than academic papers on quantization; lacks automated benchmarking but enables informed decision-making through clear documentation.

temperature-and-sampling-parameter-control

Medium confidence

Exposes temperature, top-p, and other sampling parameters to control LLM output randomness and creativity. The architecture allows developers to tune these parameters per request, enabling different behaviors for different tasks (e.g., low temperature for deterministic code generation, high temperature for creative writing). Parameters are passed to the node-llama-cpp inference engine, which uses them to control the probability distribution over next tokens during generation.

Solves for

Control LLM output randomness and creativity for different task typesGenerate deterministic outputs for code or structured data (low temperature)Generate diverse or creative outputs for brainstorming or content creation (high temperature)Understand how sampling parameters affect LLM behavior and output quality

Best for

developers fine-tuning agent behavior for specific tasks

teams experimenting with output quality and diversity tradeoffs

educators teaching LLM sampling and probability concepts

Requires

Node.js 18+

node-llama-cpp with sampling parameter support

Understanding of temperature, top-p, top-k, and other sampling parameters

Limitations

Parameter effects are model-dependent; same temperature produces different results across models

No universal optimal values; requires empirical testing to find good parameters for specific tasks

Parameter tuning is manual; no automated optimization or adaptive sampling

What makes it unique

Exposes sampling parameters directly through node-llama-cpp API, with examples (think, coding modules) showing how different parameters affect output for reasoning vs code generation tasks. The Advanced Topics documentation explains parameter tuning strategies.

vs alternatives

More transparent and controllable than cloud APIs that abstract sampling, enabling fine-grained tuning; requires more manual experimentation than APIs with built-in optimization.

token-counting-and-context-window-management

Medium confidence

Provides utilities and patterns for tracking token usage and managing context window constraints to prevent exceeding model limits. The architecture includes token counting logic (either through node-llama-cpp's built-in tokenizer or external libraries) that estimates prompt and response token counts before generation. Developers can use this information to implement context windowing strategies (e.g., dropping oldest messages when approaching limit) or warn users when approaching capacity.

Solves for

Track token usage to understand cost and performance implicationsImplement context window management for long conversations without exceeding limitsOptimize prompt length and conversation history to fit within model constraintsUnderstand how token counting affects agent design and memory management

Best for

developers building agents with limited context windows (4K-8K tokens)

teams managing long-running conversations or multi-turn reasoning

builders optimizing for cost or latency in token-constrained scenarios

Requires

Node.js 18+

Tokenizer implementation (node-llama-cpp's built-in or external library like js-tiktoken)

Token counting logic integrated into prompt building

Limitations

Token counting is approximate; actual token count may differ from estimates due to tokenizer variations

No built-in context windowing strategies; developers must implement their own (e.g., sliding window, summarization)

Counting adds overhead; frequent token counting can impact performance

What makes it unique

Addresses token management as an explicit concern in the learning path, with Advanced Topics documentation on token counting and cost optimization. Shows how to integrate token counting into agent loops to prevent context overflow.

vs alternatives

More transparent than cloud APIs that abstract token counting, enabling developers to understand and optimize token usage; requires manual implementation of windowing strategies, unlike some frameworks with built-in context management.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with ai-agents-from-scratch, ranked by overlap. Discovered automatically through the match graph.

Repository27

LMQL

LMQL is a query language for large language...

local-model-executionmulti-backend-model-abstraction

2 shared capabilities

Repository24

gpt4all

A chatbot trained on a massive collection of clean assistant data including code, stories and dialogue.

local-llm-inference-with-llama-cpp-backend

1 shared capability

Product19

Private GPT

Tool for private interaction with your documents

configurable-local-llm-integration

1 shared capability

Prompt37

outlines

Structured Outputs

local model inference with transformers, llamacpp, and mlxlm backends

1 shared capability

Product20

Kilo Code

Open-source AI coding assistant for VS Code, JetBrains, and the CLI. [#opensource](https://github.com/Kilo-Org/kilocode)

local-first llm inference with pluggable model backends

1 shared capability

Product18

LMQL

LMQL is a query language for large language models.

local and remote model execution with unified interface

1 shared capability

Best For

✓developers building privacy-sensitive AI agents
✓educators teaching LLM fundamentals without cloud dependencies
✓teams prototyping agents with cost constraints
✓researchers experimenting with model quantization and inference optimization
✓developers learning agent architecture from first principles
✓teams building agents with local LLMs that lack native function calling
✓educators demonstrating tool use patterns without cloud API dependencies
✓builders prototyping custom tool ecosystems for specialized domains

Known Limitations

⚠Inference speed depends on local hardware; CPU-only inference is 10-50x slower than GPU-accelerated cloud APIs
⚠Memory footprint scales with model size; 7B parameter models require ~8GB RAM minimum
⚠No built-in batching or request queuing — single-threaded inference per model instance
⚠Platform-specific binary compilation adds ~5-10 minutes to npm install on first setup
⚠Limited to GGUF quantized models; cannot load full-precision or other formats without conversion
⚠Parsing function calls from text is fragile — LLM may generate malformed JSON or hallucinate function names

Requirements

Node.js 18+node-llama-cpp npm package (includes pre-compiled binaries)8GB+ RAM for 7B models, 16GB+ for 13B+ modelsGGUF quantized model file (e.g., Mistral, Llama 2, Phi) downloaded to ./models/ directorymacOS/Linux/Windows x64 with optional GPU drivers (CUDA 11.8+, Metal, or Vulkan)Tool definitions as JavaScript objects with {name, description, parameters, execute} structureSystem prompt that instructs LLM to output function calls in a specific format (JSON/XML)Parser function to extract and validate function calls from LLM output

Input / Output

Accepts: text prompts (string), system prompts (string), conversation history (array of message objects with role and content), tool schema definitions (JavaScript objects), user queries (text), LLM-generated text containing function calls (string), prompts (text), provider preference (local/cloud/auto), fallback strategy (fail, retry, switch provider), module code (JavaScript files), module documentation (CODE.md, CONCEPT.md), example prompts and queries (text), user messages (text), assistant responses (text), message metadata (role, timestamp, tool calls), user task or query (text), tool definitions (JavaScript objects), system prompt with ReAct instructions (text), prompt text (string), generation parameters (temperature, max_tokens, etc.), system prompt text (string), task-specific constraints or format requirements (text), array of independent prompts or requests (array of strings or objects), shared model instance (node-llama-cpp model), task description (text), hardware specifications (CPU, RAM, GPU), quality/speed preferences (qualitative), temperature value (float, typically 0.0-2.0), top-p value (float, typically 0.0-1.0), top-k value (integer, typically 0-100), other sampling parameters (repeat_penalty, etc.), prompts and messages (text), model context window size (integer), token budget or limit (integer)

Produces: text completion (string), streaming tokens (async iterator), structured JSON (when prompted with schema), parsed function calls (objects with name and arguments), tool execution results (any type, returned to LLM), agent responses (text after tool execution), LLM responses (text), provider metadata (which provider was used, latency, cost), fallback logs (if provider switching occurred), working agent implementations (JavaScript code), understanding of agent architecture (conceptual knowledge), executable examples demonstrating each concept (runnable code), conversation history (array of message objects), context-aware LLM responses (text), persisted conversation logs (JSON or database records), reasoning trace (array of thought steps and tool calls), final agent response (text), execution log with tool results (structured data), async iterator yielding individual tokens (string), accumulated full response (string after consuming stream), partial responses for progressive rendering (string), task-adapted LLM responses (text), structured output matching prompt-specified format (JSON, markdown, etc.), array of responses (array of strings), results with metadata (array of objects with response, status, timing), error reports for failed tasks (array of error objects), model recommendations (list of model names and specifications), quantization strategy guidance (Q4 vs Q5 vs Q8 tradeoffs), resource requirement estimates (RAM, disk, inference time), LLM responses with controlled randomness (text), parameter effectiveness metrics (quality, diversity, consistency), token count estimates (integer), context window utilization percentage (float), warnings or errors when approaching limits (string)

UnfragileRank

Adoption56%(30% weight)

Quality51%(25% weight)

Ecosystem60%(20% weight)

Match Graph10%(20% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Agent

12 capabilities

Visit ai-agents-from-scratch→

Repository Details

3,417

Stars

515

Forks

JavaScript

Language

MIT

License

Topics

ai-agentseducationalfunction-callingllmllm-agentnode-llama-cppreact-agenttutorial

Last commit: Apr 19, 2026

About

Demystify AI agents by building them yourself. Local LLMs, no black boxes, real understanding of function calling, memory, and ReAct patterns.

Alternatives to ai-agents-from-scratch

vitest-llm-reporter30Repository

A Vitest reporter optimized for LLM parsing with structured, concise output

Compare →

vectra41Repository

A lightweight, file-backed vector database for Node.js and browsers with Pinecone-compatible filtering and hybrid BM25 search.

Compare →

@tanstack/ai37API

Core TanStack AI library - Open source AI SDK

Compare →

strapi-plugin-embeddings32Repository

AI embeddings and semantic search plugin for Strapi v5 with pgvector support

Compare →

Are you the builder of ai-agents-from-scratch?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

github

Looking for something else?

Search →

Capabilities12 decomposed

local-llm-inference-via-node-llama-cpp

Medium confidence

Solves for

Best for

developers building privacy-sensitive AI agents

educators teaching LLM fundamentals without cloud dependencies

teams prototyping agents with cost constraints

Requires

Node.js 18+

node-llama-cpp npm package (includes pre-compiled binaries)

8GB+ RAM for 7B models, 16GB+ for 13B+ models

Limitations

Inference speed depends on local hardware; CPU-only inference is 10-50x slower than GPU-accelerated cloud APIs

Memory footprint scales with model size; 7B parameter models require ~8GB RAM minimum

No built-in batching or request queuing — single-threaded inference per model instance

What makes it unique

vs alternatives

function-calling-with-tool-schema-binding

Medium confidence

Solves for

Best for

developers learning agent architecture from first principles

teams building agents with local LLMs that lack native function calling

educators demonstrating tool use patterns without cloud API dependencies

Requires

Node.js 18+

Tool definitions as JavaScript objects with {name, description, parameters, execute} structure

System prompt that instructs LLM to output function calls in a specific format (JSON/XML)

Limitations

Parsing function calls from text is fragile — LLM may generate malformed JSON or hallucinate function names

No built-in retry logic if parsing fails; requires manual error handling and re-prompting

Schema validation is manual; no automatic type coercion or constraint enforcement

What makes it unique

vs alternatives

hybrid-local-cloud-model-switching

Medium confidence

Solves for

Best for

developers evaluating local vs cloud LLM tradeoffs

teams building hybrid systems with fallback capabilities

builders prototyping with cloud APIs then optimizing with local models

Requires

Node.js 18+

node-llama-cpp for local inference

OpenAI SDK or Anthropic SDK for cloud APIs

Limitations

Abstraction adds complexity; unified interface may not expose all model-specific features

Cost tracking becomes complex with multiple providers; requires separate accounting for local vs cloud usage

Quality differences between models may require different prompts or parameters; unified interface may not accommodate all variations

What makes it unique

vs alternatives

progressive-learning-path-with-modular-examples

Medium confidence

Solves for

Best for

developers new to AI agents wanting to understand fundamentals

educators teaching agent development and LLM concepts

teams building custom agents and needing architectural understanding

Requires

Node.js 18+

JavaScript/async programming knowledge

Willingness to read and modify code

Limitations

Learning path is linear; skipping modules may create knowledge gaps

Examples are educational, not production-ready; require significant refactoring for real use

No interactive exercises or automated testing; learners must manually verify understanding

What makes it unique

vs alternatives

persistent-conversation-memory-with-message-history

Medium confidence

Solves for

Best for

developers building conversational agents with local LLMs

educators teaching context management and memory patterns

teams prototyping chatbots that need session-level memory

Requires

Node.js 18+

Message history data structure (array of {role, content} objects)

LLM context window large enough for full history + new prompt (typically 2K-4K tokens for multi-turn)

Limitations

No automatic token counting — developers must manually track context window usage or implement token budgeting

Full history approach hits context window limits quickly; requires manual windowing or summarization for long conversations

No built-in persistence — memory is lost when process terminates unless explicitly saved to disk/database

What makes it unique

vs alternatives

react-pattern-agent-orchestration

Medium confidence

Solves for

Best for

developers building autonomous agents with local LLMs

educators teaching agent orchestration patterns and reasoning loops

teams prototyping multi-step task automation (research, planning, execution)

Requires

Node.js 18+

Local LLM with sufficient context window (4K+ tokens recommended)

Tool definitions with schema and execution functions

Limitations

No built-in loop termination logic — requires manual max-iterations limit or explicit stop conditions to prevent infinite loops

Tool execution errors are not automatically handled; malformed tool calls or failed executions require explicit error recovery logic

Reasoning traces consume tokens rapidly; long reasoning chains can exhaust context window before reaching a solution

What makes it unique

vs alternatives

streaming-token-generation-with-async-iteration

Medium confidence

Solves for

Best for

developers building interactive chat applications or web interfaces

teams creating responsive user experiences with LLM outputs

builders implementing real-time collaboration features with AI

Requires

Node.js 18+ with async/await and async iterator support

node-llama-cpp with streaming API enabled

Async function to consume the token stream (e.g., for loop with await)

Limitations

Streaming adds complexity to error handling — errors may occur mid-stream, requiring graceful degradation

Token-by-token processing prevents some optimizations (e.g., batch validation, structured output parsing)

Streaming output is harder to validate or constrain to a schema; requires post-processing or specialized parsing

What makes it unique

vs alternatives

system-prompt-specialization-for-task-adaptation

Medium confidence

Solves for

Best for

developers building multi-purpose agent systems with task switching

teams experimenting with prompt engineering for behavior control

educators teaching how LLMs respond to instructions and constraints

Requires

Node.js 18+

Well-crafted system prompt text (typically 100-500 tokens)

Understanding of target LLM's capabilities and limitations

Limitations

Prompt engineering is empirical and non-deterministic — same prompt may produce different results across runs or models

System prompts consume tokens; complex prompts reduce available context for user input and conversation history

No guarantee that LLM will follow system prompt constraints; models may ignore instructions or hallucinate

What makes it unique

vs alternatives

batch-parallel-processing-with-concurrent-inference

Medium confidence

Solves for

Best for

developers building batch processing pipelines with local LLMs

teams processing large document collections or datasets with AI

builders implementing multi-user agent systems with resource sharing

Requires

Node.js 18+ with Promise and async/await support

Sufficient RAM to load model once and run multiple inference tasks (typically 2-4x base model memory)

Task queue or concurrency limiter to prevent resource exhaustion (e.g., p-limit library)

Limitations

Concurrency is limited by available RAM and CPU; too many parallel tasks cause memory exhaustion or context switching overhead

node-llama-cpp may not support true parallelism on all platforms; some implementations serialize inference despite async code

No built-in load balancing or queue management; developers must manually limit concurrent tasks to avoid resource exhaustion

What makes it unique

vs alternatives

model-selection-and-quantization-strategy-guidance

Medium confidence

Solves for

Best for

developers new to local LLM deployment making first model selection

teams evaluating models for production agent systems

educators teaching model selection criteria and quantization concepts

Requires

Node.js 18+

npx node-llama-cpp pull command to browse available models

Understanding of model parameters (7B, 13B, 70B) and quantization levels (Q4, Q5, Q8)

Limitations

Guidance is general; optimal model depends on specific task and cannot be determined without testing

Quantization quality varies by model and quantization method; no universal quality metric

Inference speed benchmarks are hardware-dependent; speeds vary significantly across CPU/GPU configurations

What makes it unique

vs alternatives

temperature-and-sampling-parameter-control

Medium confidence

Solves for

Best for

developers fine-tuning agent behavior for specific tasks

teams experimenting with output quality and diversity tradeoffs

educators teaching LLM sampling and probability concepts

Requires

Node.js 18+

node-llama-cpp with sampling parameter support

Understanding of temperature, top-p, top-k, and other sampling parameters

Limitations

Parameter effects are model-dependent; same temperature produces different results across models

No universal optimal values; requires empirical testing to find good parameters for specific tasks

Parameter tuning is manual; no automated optimization or adaptive sampling

What makes it unique

vs alternatives

More transparent and controllable than cloud APIs that abstract sampling, enabling fine-grained tuning; requires more manual experimentation than APIs with built-in optimization.

token-counting-and-context-window-management

Medium confidence

Solves for

Best for

developers building agents with limited context windows (4K-8K tokens)

teams managing long-running conversations or multi-turn reasoning

builders optimizing for cost or latency in token-constrained scenarios

Requires

Node.js 18+

Tokenizer implementation (node-llama-cpp's built-in or external library like js-tiktoken)

Token counting logic integrated into prompt building

Limitations

Token counting is approximate; actual token count may differ from estimates due to tokenizer variations

No built-in context windowing strategies; developers must implement their own (e.g., sliding window, summarization)

Counting adds overhead; frequent token counting can impact performance

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to ai-agents-from-scratch

vitest-llm-reporter30Repository

A Vitest reporter optimized for LLM parsing with structured, concise output

Compare →

vectra41Repository

A lightweight, file-backed vector database for Node.js and browsers with Pinecone-compatible filtering and hybrid BM25 search.

Compare →

@tanstack/ai37API

Core TanStack AI library - Open source AI SDK

Compare →

strapi-plugin-embeddings32Repository

AI embeddings and semantic search plugin for Strapi v5 with pgvector support

Compare →

ai-agents-from-scratch

Capabilities12 decomposed

local-llm-inference-via-node-llama-cpp

function-calling-with-tool-schema-binding

hybrid-local-cloud-model-switching

progressive-learning-path-with-modular-examples

persistent-conversation-memory-with-message-history

react-pattern-agent-orchestration

streaming-token-generation-with-async-iteration

system-prompt-specialization-for-task-adaptation

batch-parallel-processing-with-concurrent-inference

model-selection-and-quantization-strategy-guidance

temperature-and-sampling-parameter-control

token-counting-and-context-window-management

Related Artifactssharing capabilities

LMQL

gpt4all

Private GPT

outlines

Kilo Code

LMQL

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Repository Details

About

Categories

Alternatives to ai-agents-from-scratch

Are you the builder of ai-agents-from-scratch?

Get the weekly brief

Data Sources

ai-agents-from-scratch

Capabilities12 decomposed

local-llm-inference-via-node-llama-cpp

function-calling-with-tool-schema-binding

hybrid-local-cloud-model-switching

progressive-learning-path-with-modular-examples

persistent-conversation-memory-with-message-history

react-pattern-agent-orchestration

streaming-token-generation-with-async-iteration

system-prompt-specialization-for-task-adaptation

batch-parallel-processing-with-concurrent-inference

model-selection-and-quantization-strategy-guidance

temperature-and-sampling-parameter-control

token-counting-and-context-window-management

Related Artifactssharing capabilities

LMQL

gpt4all

Private GPT

outlines

Kilo Code

LMQL

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Repository Details

About

Categories

Alternatives to ai-agents-from-scratch

Are you the builder of ai-agents-from-scratch?

Get the weekly brief

Data Sources