pi extension · v… · MIT

Let your LLM crawl & see
the web.

pi-websearch-crawl4ai is a pi extension that gives your coding agent a set of web_fetch / web_crawl / web_screenshot tools, backed by a self-hosted Crawl4AI server. Clean markdown, sanitized HTML, and full-page screenshots — even after you disable pi's bash tool and take curl away.

6 web tools registered 0 extra npm deps ✓ self-hosted — your Crawl4AI server ✓ works with --tools read,write,edit,web_fetch,…
~ pi — bash disabled · web_fetch enabled
// User: summarize https://docs.crawl4ai.com/core/markdown-generation/
 web_fetch(url="https://docs.crawl4ai.com/core/markdown-generation/",
                   filter="fit")
  # Markdown Generation
  Crawl4AI transforms HTML into clean, LLM-friendly Markdown …
  ✓ 200 OK · 4,812 chars · filter=fit

 web_crawl(urls=["https://a.example", "https://b.example"],
                   crawler_config={"type":"CrawlerRunConfig",
                                   "params":{"cache_mode":"bypass"}})
  ## https://a.example    success=true status=200
  ## https://b.example    success=true status=200

 bash(command="curl https://example.com")
  ✗ tool bash is disabled in this session — use web_fetch instead

 web_screenshot(url="https://example.com")
  ✓ png 118 KB · returned inline to the model

Why this exists

The safest pi sessions run with the bash tool turned off — no shell, no curl, no wget, nothing to rm -rf. But the moment you do that, your agent loses the ability to read the web. No docs lookups. No changelog peeks. No “grab that URL the user pasted.”

pi-websearch-crawl4ai gives the web back, but through a narrow door: a handful of purpose-built tools that talk HTTP to a Crawl4AI server you run. No outbound shell. No arbitrary subprocesses. No third-party SaaS to leak your prompts to. Just fetch() to http://localhost:11235.

6
web tools exposed to the agent
0
runtime npm dependencies — uses global fetch
localhost
Crawl4AI runs on your box; nothing leaves your network
1 file
single-file extension, hot-loadable by pi

The six tools

Each tool is a thin, typed wrapper over a Crawl4AI REST endpoint, registered with pi so the model can call it directly. Every tool ships with promptSnippet and promptGuidelines so the agent knows to prefer web_fetch over bash + curl.

ToolEndpointPurpose
web_fetch POST /md URL → clean Markdown. Filters: fit (readability), raw, bm25, llm. Supports query, max_chars.
web_fetch_html POST /html Sanitized, preprocessed HTML when you need DOM structure instead of prose.
web_crawl POST /crawl Batch-crawl many URLs with typed BrowserConfig / CrawlerRunConfig passed straight through.
web_execute_js POST /execute_js Run JS snippets on a page and read the return values. For stuff behind dynamic interactions.
web_screenshot POST /screenshot Full-page PNG screenshot, returned inline as an image so vision-capable models can look at it.
web_ask GET /ask Query Crawl4AI's own library docs — useful only when the model needs to configure web_crawl itself.

Plus three slash commands: /crawl4ai-status (pings /health), /crawl4ai-url <url>, and /crawl4ai-token <tok> — for switching servers or toggling JWT without a restart.

What you get

The small set of things that matter for an LLM's web IO.

MD

LLM-grade Markdown

Crawl4AI's readability-style fit filter strips chrome, ads, and nav; what reaches the model is what a human would read — tokens well spent.

BM

Query-aware fetching

filter="bm25" + a query lets the agent pull only the most relevant chunks of a long page, keeping context windows honest.

BA

Batch crawling

web_crawl takes up to 100 URLs at once with a shared typed config, returning per-URL markdown, links, metadata, and status.

JS

JS execution

web_execute_js runs arbitrary snippets against the live DOM (return document.title, hrefs, JSON state) for pages plain fetching can't crack.

IMG

Inline screenshots

web_screenshot returns PNG bytes as an ImageContent block — vision-capable models see the page exactly as a browser would render it.

SH

Self-hosted by default

Crawl4AI runs in Docker on localhost:11235. Your URLs, prompts, and cookies never leave your network, unlike managed scraping APIs.

CT

Bounded output

Every tool honors a max_chars / max_chars_per_result cap. No agent can nuke its context by fetching a 4 MB blog.

HS

Hot config

Change the Crawl4AI endpoint or JWT on the fly with /crawl4ai-url / /crawl4ai-token — no reload, no restart.

π

Pure pi extension

Single-file TypeScript, zero runtime deps. Installs via pi install. No daemons, no extra services — just tools the agent can call.

Install

You need two things: a running Crawl4AI server, and the extension loaded into pi. Neither requires npm on your PATH; Docker runs the server, pi runs the extension.

1 · start crawl4ai
# one-liner Docker install; exposes http://localhost:11235
docker run -d \
  -p 11235:11235 \
  --name crawl4ai \
  --shm-size=1g \
  unclecode/crawl4ai:latest

# sanity check
curl http://localhost:11235/health
2 · install the extension
# from npm (recommended)
pi install npm:@codingcoffee/pi-websearch-crawl4ai

# or from git
pi install git:github.com/codingcoffee/pi-websearch-crawl4ai

# try without installing
pi -e npm:@codingcoffee/pi-websearch-crawl4ai

Then lock pi down to just the tools you trust, and add web_fetch / web_crawl to the allowlist — your agent can now read the web without ever seeing a shell:

example: bash-free pi session
pi --tools read,write,edit,web_fetch,web_crawl,web_screenshot

# inside pi
 /crawl4ai-status    # confirm the server is up
 "Read https://example.com and summarize it."

Configuration

Three sources, in priority order: CLI flag > environment variable > default. All three are per-session; nothing is written to disk.

SettingEnv varCLI flagDefault
Base URL CRAWL4AI_BASE_URL --crawl4ai-url <url> http://localhost:11235
Bearer token CRAWL4AI_TOKEN --crawl4ai-token <tok> (none)

At runtime you can retarget the extension without restarting pi:

in pi
/crawl4ai-status                         # show current config + /health
/crawl4ai-url http://10.0.0.9:11235      # switch to a LAN server
/crawl4ai-token eyJhbGciOi…              # set JWT for authenticated servers
/crawl4ai-token                          # clear the token
Production note. Crawl4AI supports rate limiting, JWT auth, allow-lists, and a managed config.yml out of the box. Run it behind an internal reverse proxy, point the extension at it with CRAWL4AI_BASE_URL, and you get a single place to audit every URL your fleet of agents fetches. See the Crawl4AI Docker guide for the knobs.

Web access for agents, without the shell.

Disable bash. Install this. Let the model read.