{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"hn-46294144","slug":"stop-ai-scrapers-from-hammering-your-self-hosted-b","name":"Stop AI scrapers from hammering your self-hosted blog","type":"repo","url":"https://github.com/vivienhenz24/fuzzy-canary","page_url":"https://unfragile.ai/stop-ai-scrapers-from-hammering-your-self-hosted-b","categories":["automation"],"tags":["hackernews","show-hn"],"pricing":{"model":"open_source","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"hn-46294144__cap_0","uri":"capability://safety.moderation.http.request.fingerprinting.and.bot.detection.via.behavioral.analysis","name":"http request fingerprinting and bot detection via behavioral analysis","description":"Analyzes incoming HTTP requests to identify AI scraper patterns by examining user-agent strings, request headers, timing patterns, and access sequences. Uses heuristic matching against known scraper signatures (GPTBot, CCBot, etc.) combined with behavioral analysis of request frequency and resource access patterns to distinguish legitimate traffic from automated crawlers without requiring IP blocklists or rate limiting.","intents":["Identify which requests are coming from AI scraper bots vs legitimate users","Detect scraper behavior patterns before they consume significant bandwidth","Distinguish between different types of bots (search engines vs AI training crawlers)"],"best_for":["Self-hosted blog operators with limited infrastructure budgets","Content creators concerned about unauthorized AI training data harvesting","Developers building bot detection into existing web servers"],"limitations":["Heuristic-based detection can have false positives if legitimate clients spoof user-agents","Requires ongoing signature updates as scrapers evolve their request patterns","Cannot detect sophisticated scrapers that mimic legitimate browser behavior perfectly"],"requires":["Web server with request header inspection capability","Access to HTTP request/response middleware layer","Ability to log and analyze request patterns"],"input_types":["HTTP request headers","User-agent strings","Request timing metadata","Access pattern sequences"],"output_types":["Bot classification (scraper/legitimate)","Confidence score","Bot type identifier"],"categories":["safety-moderation","bot-detection"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hn-46294144__cap_1","uri":"capability://safety.moderation.conditional.response.injection.based.on.bot.classification","name":"conditional response injection based on bot classification","description":"Intercepts HTTP responses destined for detected scraper bots and injects alternative content (specifically adult/NSFW material) that serves as a honeypot signal without blocking legitimate traffic. The injection happens at the middleware layer before response transmission, allowing the server to serve normal content to legitimate users while feeding scrapers with content that degrades training data quality or triggers scraper filtering mechanisms.","intents":["Serve different content to scrapers vs legitimate users without blocking either","Poison scraper training datasets with irrelevant or harmful content","Create a canary signal that indicates scraper activity without disrupting user experience"],"best_for":["Blog operators wanting to protect content without blocking access entirely","Developers implementing data poisoning strategies against AI training","Teams seeking non-confrontational scraper deterrence"],"limitations":["Requires careful content selection to avoid legal/liability issues with injected material","Sophisticated scrapers may filter injected content post-download","May violate terms of service if scrapers detect and report the injection","Effectiveness depends on scraper's ability to distinguish injected vs original content"],"requires":["Web server middleware with response interception capability","Storage or generation mechanism for alternative content","Bot classification system upstream of response injection"],"input_types":["HTTP response body","Bot classification flag","Original content"],"output_types":["Modified HTTP response","Alternative content payload"],"categories":["safety-moderation","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hn-46294144__cap_2","uri":"capability://automation.workflow.self.hosted.deployment.without.external.dependencies","name":"self-hosted deployment without external dependencies","description":"Provides a complete bot detection and response injection system deployable on self-hosted infrastructure without reliance on third-party SaaS platforms, cloud APIs, or external bot detection services. All detection logic, signature matching, and response handling runs locally on the server, eliminating latency from external API calls and avoiding data transmission to third parties.","intents":["Deploy bot protection without sending traffic metadata to external services","Maintain full control over detection logic and response behavior","Avoid subscription costs and vendor lock-in for bot detection"],"best_for":["Privacy-conscious blog operators","Teams with strict data residency requirements","Developers wanting to avoid SaaS dependencies"],"limitations":["Requires manual maintenance of scraper signature databases","No access to global threat intelligence from other deployments","Operator responsible for keeping detection rules current as scrapers evolve","Limited ability to detect novel/zero-day scraper patterns without external data"],"requires":["Self-hosted web server (Apache, Nginx, etc.)","Ability to install middleware or plugins","Maintenance capacity for signature updates"],"input_types":["Local HTTP traffic"],"output_types":["Local bot classification decisions","Modified responses"],"categories":["automation-workflow","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hn-46294144__cap_3","uri":"capability://safety.moderation.scraper.signature.matching.against.known.ai.bot.user.agents","name":"scraper signature matching against known ai bot user-agents","description":"Maintains and matches incoming request user-agent strings against a database of known AI scraper identifiers (GPTBot, CCBot, Anthropic-AI, etc.). Uses string pattern matching to identify requests from common AI training crawlers, search engine bots, and known scraper tools. The signature database can be updated to include new scraper patterns as they emerge.","intents":["Quickly identify requests from known AI training bots","Build a whitelist/blacklist of specific bot types","Track which AI companies are scraping your content"],"best_for":["Blog operators wanting visibility into which AI companies are accessing their content","Developers building bot classification systems","Teams tracking scraper activity for compliance or legal purposes"],"limitations":["Only detects bots that identify themselves via user-agent headers","Sophisticated scrapers can spoof legitimate user-agent strings","Requires manual updates to signature database as new bots emerge","Cannot detect headless browser scrapers that use legitimate browser user-agents"],"requires":["Access to HTTP request headers","Signature database of known scraper user-agents","String matching/regex engine"],"input_types":["HTTP User-Agent header","Request headers"],"output_types":["Bot type identifier","Match confidence"],"categories":["safety-moderation","search-retrieval"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hn-46294144__cap_4","uri":"capability://automation.workflow.lightweight.middleware.integration.for.existing.web.servers","name":"lightweight middleware integration for existing web servers","description":"Integrates as a thin middleware layer into existing web server stacks (Nginx, Apache, etc.) without requiring major architectural changes or application rewrites. The middleware intercepts requests early in the request pipeline, performs bot classification, and conditionally modifies responses before they reach the client, minimizing performance overhead and integration complexity.","intents":["Add bot detection to an existing blog without rewriting the application","Integrate scraper protection into a running web server with minimal downtime","Maintain separation between bot detection logic and application code"],"best_for":["Blog operators with existing web server deployments","Teams wanting to add bot protection without application changes","Developers preferring infrastructure-level solutions over application-level changes"],"limitations":["Middleware overhead adds latency to every request (typically 5-50ms depending on implementation)","Integration method depends on specific web server (Nginx modules vs Apache modules vs reverse proxy)","May require web server restart to deploy updates","Limited visibility into application-level context (can only see HTTP layer)"],"requires":["Web server with middleware/module support (Nginx, Apache, HAProxy, etc.)","Ability to modify web server configuration","Appropriate middleware/module for target web server"],"input_types":["HTTP request"],"output_types":["Modified HTTP response"],"categories":["automation-workflow","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":46,"verified":false,"data_access_risk":"high","permissions":["Web server with request header inspection capability","Access to HTTP request/response middleware layer","Ability to log and analyze request patterns","Web server middleware with response interception capability","Storage or generation mechanism for alternative content","Bot classification system upstream of response injection","Self-hosted web server (Apache, Nginx, etc.)","Ability to install middleware or plugins","Maintenance capacity for signature updates","Access to HTTP request headers"],"failure_modes":["Heuristic-based detection can have false positives if legitimate clients spoof user-agents","Requires ongoing signature updates as scrapers evolve their request patterns","Cannot detect sophisticated scrapers that mimic legitimate browser behavior perfectly","Requires careful content selection to avoid legal/liability issues with injected material","Sophisticated scrapers may filter injected content post-download","May violate terms of service if scrapers detect and report the injection","Effectiveness depends on scraper's ability to distinguish injected vs original content","Requires manual maintenance of scraper signature databases","No access to global threat intelligence from other deployments","Operator responsible for keeping detection rules current as scrapers evolve","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.82,"quality":0.2,"ecosystem":0.36,"match_graph":0.25,"freshness":0.9,"weights":{"adoption":0.3,"quality":0.2,"ecosystem":0.15,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:23.326Z","last_scraped_at":"2026-05-04T08:09:54.663Z","last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=stop-ai-scrapers-from-hammering-your-self-hosted-b","compare_url":"https://unfragile.ai/compare?artifact=stop-ai-scrapers-from-hammering-your-self-hosted-b"}},"signature":"4oCaNN26mpMuHHczVYv1KaCWyTylTEVoaS9GRuhi93RHKKgnP3LmCc3o3DaiusRcOPfY4B0y6uQ/aSYpFBzpDg==","signedAt":"2026-06-15T19:27:27.068Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/stop-ai-scrapers-from-hammering-your-self-hosted-b","artifact":"https://unfragile.ai/stop-ai-scrapers-from-hammering-your-self-hosted-b","verify":"https://unfragile.ai/api/v1/verify?slug=stop-ai-scrapers-from-hammering-your-self-hosted-b","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}