Crawling shouldn't be that hard

An easy
tool for
generating
LLM ready
markdown
files

An easy tool for generating LLM ready markdown files.

o3 has a powerful search – it can even keep searching the web for minutes until it finds the precise location where a picture was shot. Maybe my Opinion Search is already outdated...

Sometimes all I want is to download the documentation of a website and upload it to Claude or ChatGPT project documents. When I'm about to work on something for a few days where I know the required knowledge is newer than the LLMs' cutoff, I don't want to ask the llm to do a web search at the start of every chat to re-load all the context.

I tried some online solution, FireCrawl seems like the coolest one, but crawling from the playground sometimes returns only a single page, and I'm definitely not paying monthly for crawling credits. On top of that, the new agent "FIRE-1" is quite expensive: as of today, it's 150 credits per page + 0-900 based on complexity, and you buy 1000 credits for $9. So... a few dollars per request? Really?

So, I wrote my own quick solution with crawl4ai. Here's what I needed:

  • Crawl a website or a llms.txt page with some depth.
  • Extract a decent markdown text (some header/footer/nav cleanup). Models are good, I don't need perfection.
  • Merge everything into a few files (Claude project has a limit of 20 files).
  • Save to disk so that I can just drag and drop the files into Claude/ChatGPT UI.

It turned out to be a quick project – also thanks to LLMs, of course; would have I started working on it if the first raw version would have taken more than a single sitting? Probably not.

Here's the link on GitHub: url2llm

Now, when I need a site crawled, I just run a command like this:

uv run \
   --with url2llm \
   url2llm \
   --depth 2 \
   --url "https://modelcontextprotocol.io/" \
   --instruction "I need documents related to developing MCP (model context protocol) servers" \
   --provider "gemini/gemini-2.5-flash-preview-04-17" \
   --api_key ${GEMINI_API_KEY} \
   --output-dir ~/Desktop/crawl_out/

And I've got the clean file model-context-protocol-documentation.md ready to go!