raw html fetching with javascript rendering
Fetches live web content as raw HTML with optional JavaScript execution via the Crawlbase API backend. The MCP server wraps Crawlbase's rendering infrastructure, supporting both static HTML requests (using CRAWLBASE_TOKEN) and JavaScript-rendered pages (using CRAWLBASE_JS_TOKEN). Requests are routed through a retry queue with exponential backoff for resilience against transient failures.
Unique: Integrates Crawlbase's production-grade proxy rotation and anti-bot evasion infrastructure directly into the MCP protocol, eliminating the need for agents to manage their own proxy pools or handle bot detection. Uses dual-token authentication (standard vs JS) to optimize cost by routing requests to appropriate backend infrastructure based on rendering requirements.
vs alternatives: Provides JavaScript rendering and proxy rotation out-of-the-box (unlike Puppeteer/Playwright which require local infrastructure), while being simpler to deploy than self-hosted scraping stacks and offering geographic targeting that pure headless browser solutions don't provide.
markdown content extraction from web pages
Extracts and converts web page content to clean, structured markdown format via the crawl_markdown tool. The MCP server delegates to Crawlbase's content processing pipeline, which parses HTML, removes boilerplate (navigation, ads, footers), and outputs markdown-formatted text suitable for LLM consumption. Supports the same rendering options as raw HTML fetching (JavaScript execution, proxy rotation, geographic targeting).
Unique: Provides server-side markdown extraction as part of the Crawlbase API rather than requiring client-side HTML parsing libraries. Combines JavaScript rendering, proxy rotation, and content extraction in a single API call, reducing latency and complexity compared to fetch-then-parse workflows.
vs alternatives: Eliminates the need for separate HTML parsing libraries (Cheerio, jsdom) and handles JavaScript-rendered content natively, whereas client-side extraction tools require either headless browsers or static HTML parsing that fails on dynamic content.
multi-sdk support across node.js, python, java, php, and .net
Provides official SDKs for multiple programming languages (Node.js, Python, Java, PHP, .NET) that wrap the Crawlbase API, enabling developers to use web scraping capabilities from their preferred language. Each SDK implements the same core functionality (HTML fetching, markdown extraction, screenshot capture) with language-idiomatic APIs. SDKs handle authentication, request formatting, and response parsing, abstracting away HTTP details.
Unique: Provides official SDKs for five major programming languages, enabling native integration without HTTP client boilerplate. Each SDK implements consistent APIs while respecting language conventions (e.g., async/await in Python, Promises in Node.js, Futures in Java).
vs alternatives: More convenient than raw HTTP clients for each language; however, less flexible than direct API access for non-standard use cases or advanced features not exposed in SDKs.
webpage screenshot capture with rendering
Captures full-page or viewport screenshots of web content as base64-encoded images via the crawl_screenshot tool. The MCP server delegates to Crawlbase's screenshot infrastructure, which renders pages with JavaScript execution, applies geographic/device targeting, and returns PNG images encoded as base64 strings. Supports the same proxy rotation and anti-bot evasion as HTML fetching.
Unique: Provides server-side screenshot rendering with proxy rotation and geographic targeting, eliminating the need for agents to manage headless browser instances. Returns base64-encoded images directly compatible with vision-capable LLMs, enabling multi-modal analysis without intermediate image storage.
vs alternatives: Simpler than deploying Puppeteer/Playwright infrastructure and includes anti-bot evasion that headless browsers lack; however, less flexible than client-side rendering for custom viewport sizes or interaction sequences.
dual-mode mcp server deployment (stdio and http)
Provides two distinct operational modes for integrating web scraping into AI applications: stdio mode for direct subprocess communication with desktop AI clients (Claude, Cursor, Windsurf) via standard input/output streams, and HTTP mode for standalone network server deployments supporting multi-user access and custom integrations. Both modes expose the same three tools (crawl, crawl_markdown, crawl_screenshot) through the standardized MCP protocol, with authentication handled via environment variables (stdio) or HTTP headers (HTTP mode).
Unique: Implements both stdio and HTTP transport layers within a single codebase, allowing the same MCP server to operate as a subprocess for desktop clients or as a standalone network service. Uses StdioServerTransport from @modelcontextprotocol/sdk for stdio mode and Express.js for HTTP mode, providing flexibility for different deployment architectures without code duplication.
vs alternatives: More flexible than single-mode MCP servers; supports both local desktop integration and cloud deployments from the same codebase. Simpler than building separate stdio and HTTP implementations while maintaining the standardized MCP protocol interface.
retry queue with exponential backoff for resilience
Implements automatic retry logic with exponential backoff for failed Crawlbase API requests, improving reliability for transient failures (network timeouts, temporary API unavailability, rate limiting). The retry queue is integrated into the request processing pipeline, transparently retrying failed requests without exposing retry logic to the MCP client. Backoff strategy prevents overwhelming the Crawlbase API during outages.
Unique: Integrates retry logic at the MCP server level rather than requiring each client to implement its own retry strategy. Exponential backoff prevents thundering herd problems during API outages, and transparent retry handling keeps the MCP protocol interface simple.
vs alternatives: Simpler than client-side retry logic and prevents duplicate retry attempts across multiple clients; however, lacks configurability compared to libraries like axios-retry or p-retry that expose backoff parameters.
geographic targeting and device emulation
Enables requests to be routed through Crawlbase's proxy infrastructure with geographic targeting and device emulation, allowing agents to fetch content as if browsing from different regions or device types. Implemented via request parameters passed to the Crawlbase API, supporting country/region selection and device type emulation (mobile, desktop, tablet). Useful for testing geo-blocked content, mobile-specific rendering, or region-specific pricing.
Unique: Leverages Crawlbase's distributed proxy infrastructure to provide geographic targeting and device emulation as first-class request parameters, eliminating the need for agents to manage their own proxy pools or device emulation logic. Integrated directly into the MCP tool parameters.
vs alternatives: Simpler than managing separate proxy providers or device emulation libraries; however, less flexible than Puppeteer/Playwright for custom device configurations or interaction sequences.
mcp protocol tool registration and schema validation
Registers the three web scraping tools (crawl, crawl_markdown, crawl_screenshot) as MCP tools with standardized JSON schemas, enabling AI clients to discover and invoke them through the MCP protocol. Each tool has a defined schema specifying input parameters (URL, optional request options) and output types (HTML, markdown, or base64 image). Schema validation ensures requests conform to expected types before being forwarded to Crawlbase API.
Unique: Implements MCP tool registration using the @modelcontextprotocol/sdk, providing standardized tool discovery and invocation for AI clients. Schemas are defined declaratively and validated automatically, reducing boilerplate compared to custom RPC implementations.
vs alternatives: Standardized MCP protocol enables interoperability with multiple AI clients without custom integration code; however, less flexible than custom RPC implementations for non-standard tool patterns.
+3 more capabilities