ragcrawl¶
Recursive website crawler producing LLM-ready knowledge base artifacts
ragcrawl is a Python library for crawling websites and producing clean, structured content optimized for Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems.
-
Quick Start
Get up and running in minutes with our quickstart guide.
-
User Guide
Learn how to crawl websites, sync updates, and export content.
-
Configuration
Customize crawler behavior, storage, and output formats.
-
API Reference
Complete Python API documentation with examples.
Features¶
Clean Markdown Output¶
Convert web pages to clean, readable Markdown while preserving semantic structure like headings, code blocks, and lists.
from ragcrawl import CrawlJob, CrawlerConfig
config = CrawlerConfig(
seeds=["https://docs.example.com"],
max_pages=100,
)
job = CrawlJob(config)
result = await job.run()
for doc in result.documents:
print(f"# {doc.title}\n{doc.markdown[:200]}...")
Incremental Sync¶
Efficiently update your knowledge base with only changed content using sitemap detection, ETags, and content hashing.
from ragcrawl import SyncJob, SyncConfig
config = SyncConfig(
site_id="site_abc123",
use_sitemap=True,
)
job = SyncJob(config)
result = await job.run()
print(f"New: {result.stats.pages_new}")
print(f"Updated: {result.stats.pages_changed}")
RAG-Ready Chunking¶
Built-in chunking strategies optimized for embedding models with heading-aware and token-based options.
from ragcrawl.chunking import HeadingChunker
chunker = HeadingChunker(max_tokens=500)
chunks = chunker.chunk_documents(result.documents)
for chunk in chunks:
# Ready for embedding API
print(f"Section: {chunk.section_path}")
print(f"Tokens: {chunk.token_estimate}")
Flexible Storage¶
Choose between DuckDB for local development or DynamoDB for cloud deployments.
Installation¶
CLI Quick Start¶
# Crawl a documentation site
ragcrawl crawl https://docs.example.com --max-pages 100
# Sync for updates
ragcrawl sync site_abc123
# List crawled sites
ragcrawl sites
# View crawl history
ragcrawl runs site_abc123
Use Cases¶
| Use Case | Description |
|---|---|
| Documentation RAG | Build Q&A systems from technical docs |
| Knowledge Base | Create searchable internal wikis |
| Content Migration | Extract structured content from websites |
| Research | Collect and analyze web content |
Architecture¶
┌─────────────┐ ┌──────────┐ ┌───────────┐
│ Fetcher │────▶│ Extractor│────▶│ Storage │
│ (HTTP/Browser) │ (HTML→MD)│ │(DuckDB/Dyn)
└─────────────┘ └──────────┘ └───────────┘
│ │
▼ ▼
┌─────────────┐ ┌──────────┐ ┌───────────┐
│ Frontier │ │ Chunker │◀────│ Export │
│ Queue │ │(Heading/ │ │(JSON/JSONL)
└─────────────┘ │ Token) │ └───────────┘
└──────────┘
│
▼
┌───────────┐
│ Publisher │
│(Single/ │
│ Multi) │
└───────────┘
Next Steps¶
-
Detailed installation instructions
-
Start crawling in 5 minutes
-
Command-line interface guide
-
Report issues and contribute
Community¶
We welcome contributions from the community! Here's how you can get involved:
-
Learn how to contribute code, documentation, and more
-
Our community standards and expectations
-
Get help and report issues
-
Release history and updates
License¶
ragcrawl is licensed under the Apache License 2.0. See the LICENSE file for details.