Crawling shouldn't be that hard
An easy
tool for
generating
LLM ready
markdown
files
24 Apr 2025
An easy tool for generating LLM ready markdown files.
o3 has a powerful search – it can even keep searching the web for minutes until it finds the precise location where a picture was shot. Maybe my Opinion Search is already outdated...
Sometimes all I want is to download the documentation of a website and upload it to Claude or ChatGPT project documents. When I'm about to work on something for a few days where I know the required knowledge is newer than the LLMs' cutoff, I don't want to ask the llm to do a web search at the start of every chat to re-load all the context.
I tried some online solution, FireCrawl seems like the coolest one, but crawling from the playground sometimes returns only a single page, and I'm definitely not paying monthly for crawling credits. On top of that, the new agent "FIRE-1" is quite expensive: as of today, it's 150 credits per page + 0-900 based on complexity, and you buy 1000 credits for $9. So... a few dollars per request? Really?
So, I wrote my own quick solution with crawl4ai. Here's what I needed:
- Crawl a website or a llms.txt page with some depth.
- Extract a decent markdown text (some header/footer/nav cleanup). Models are good, I don't need perfection.
- Merge everything into a few files (Claude project has a limit of 20 files).
- Save to disk so that I can just drag and drop the files into Claude/ChatGPT UI.
It turned out to be a quick project – also thanks to LLMs, of course; would have I started working on it if the first raw version would have taken more than a single sitting? Probably not.
Here's the link on GitHub: url2llm
Now, when I need a site crawled, I just run a command like this:
uv run \
--with url2llm \
url2llm \
--depth 2 \
--url "https://modelcontextprotocol.io/" \
--instruction "I need documents related to developing MCP (model context protocol) servers" \
--provider "gemini/gemini-2.5-flash-preview-04-17" \
--api_key ${GEMINI_API_KEY} \
--output-dir ~/Desktop/crawl_out/
And I've got the clean file model-context-protocol-documentation.md
ready to go!