pi-websearch-crawl4ai is a
pi
extension that gives your coding agent a set of
web_fetch / web_crawl / web_screenshot
tools, backed by a self-hosted
Crawl4AI
server. Clean markdown, sanitized HTML, and full-page screenshots — even
after you disable pi's bash tool and take curl
away.
--tools read,write,edit,web_fetch,…
// User: summarize https://docs.crawl4ai.com/core/markdown-generation/ ▶ web_fetch(url="https://docs.crawl4ai.com/core/markdown-generation/", filter="fit") # Markdown Generation Crawl4AI transforms HTML into clean, LLM-friendly Markdown … ✓ 200 OK · 4,812 chars · filter=fit ▶ web_crawl(urls=["https://a.example", "https://b.example"], crawler_config={"type":"CrawlerRunConfig", "params":{"cache_mode":"bypass"}}) ## https://a.example success=true status=200 ## https://b.example success=true status=200 ▶ bash(command="curl https://example.com") ✗ tool bash is disabled in this session — use web_fetch instead ▶ web_screenshot(url="https://example.com") ✓ png 118 KB · returned inline to the model
The safest pi sessions run with the bash tool turned off —
no shell, no curl, no wget, nothing to
rm -rf. But the moment you do that, your agent loses the
ability to read the web. No docs lookups. No changelog peeks. No
“grab that URL the user pasted.”
pi-websearch-crawl4ai gives the web back,
but through a narrow door: a handful of purpose-built tools that talk
HTTP to a
Crawl4AI
server you run. No outbound shell. No arbitrary subprocesses.
No third-party SaaS to leak your prompts to. Just fetch()
to http://localhost:11235.
fetch
Each tool is a thin, typed wrapper over a Crawl4AI REST endpoint,
registered with pi so the model can call it directly. Every tool ships
with promptSnippet and promptGuidelines so the
agent knows to prefer web_fetch over
bash + curl.
| Tool | Endpoint | Purpose |
|---|---|---|
web_fetch |
POST /md |
URL → clean Markdown. Filters: fit (readability), raw, bm25, llm. Supports query, max_chars. |
web_fetch_html |
POST /html |
Sanitized, preprocessed HTML when you need DOM structure instead of prose. |
web_crawl |
POST /crawl |
Batch-crawl many URLs with typed BrowserConfig / CrawlerRunConfig passed straight through. |
web_execute_js |
POST /execute_js |
Run JS snippets on a page and read the return values. For stuff behind dynamic interactions. |
web_screenshot |
POST /screenshot |
Full-page PNG screenshot, returned inline as an image so vision-capable models can look at it. |
web_ask |
GET /ask |
Query Crawl4AI's own library docs — useful only when the model needs to configure web_crawl itself. |
Plus three slash commands: /crawl4ai-status (pings /health),
/crawl4ai-url <url>, and /crawl4ai-token <tok>
— for switching servers or toggling JWT without a restart.
The small set of things that matter for an LLM's web IO.
Crawl4AI's readability-style fit filter strips chrome, ads, and nav; what reaches the model is what a human would read — tokens well spent.
filter="bm25" + a query lets the agent pull only the most relevant chunks of a long page, keeping context windows honest.
web_crawl takes up to 100 URLs at once with a shared typed config, returning per-URL markdown, links, metadata, and status.
web_execute_js runs arbitrary snippets against the live DOM (return document.title, hrefs, JSON state) for pages plain fetching can't crack.
web_screenshot returns PNG bytes as an ImageContent block — vision-capable models see the page exactly as a browser would render it.
Crawl4AI runs in Docker on localhost:11235. Your URLs, prompts, and cookies never leave your network, unlike managed scraping APIs.
Every tool honors a max_chars / max_chars_per_result cap. No agent can nuke its context by fetching a 4 MB blog.
Change the Crawl4AI endpoint or JWT on the fly with /crawl4ai-url / /crawl4ai-token — no reload, no restart.
Single-file TypeScript, zero runtime deps. Installs via pi install. No daemons, no extra services — just tools the agent can call.
You need two things: a running Crawl4AI server, and the extension loaded
into pi. Neither requires npm on your PATH;
Docker runs the server, pi runs the extension.
# one-liner Docker install; exposes http://localhost:11235 docker run -d \ -p 11235:11235 \ --name crawl4ai \ --shm-size=1g \ unclecode/crawl4ai:latest # sanity check curl http://localhost:11235/health
# from npm (recommended) pi install npm:@codingcoffee/pi-websearch-crawl4ai # or from git pi install git:github.com/codingcoffee/pi-websearch-crawl4ai # try without installing pi -e npm:@codingcoffee/pi-websearch-crawl4ai
Then lock pi down to just the tools you trust, and add
web_fetch / web_crawl to the allowlist — your
agent can now read the web without ever seeing a shell:
pi --tools read,write,edit,web_fetch,web_crawl,web_screenshot # inside pi ▶ /crawl4ai-status # confirm the server is up ▶ "Read https://example.com and summarize it."
Three sources, in priority order: CLI flag > environment variable > default. All three are per-session; nothing is written to disk.
| Setting | Env var | CLI flag | Default |
|---|---|---|---|
| Base URL | CRAWL4AI_BASE_URL |
--crawl4ai-url <url> |
http://localhost:11235 |
| Bearer token | CRAWL4AI_TOKEN |
--crawl4ai-token <tok> |
(none) |
At runtime you can retarget the extension without restarting pi:
/crawl4ai-status # show current config + /health /crawl4ai-url http://10.0.0.9:11235 # switch to a LAN server /crawl4ai-token eyJhbGciOi… # set JWT for authenticated servers /crawl4ai-token # clear the token
config.yml out of the box. Run
it behind an internal reverse proxy, point the extension at it with
CRAWL4AI_BASE_URL, and you get a single place to audit
every URL your fleet of agents fetches. See the
Crawl4AI Docker guide
for the knobs.
Disable bash. Install this. Let the model read.