Skip to content

ragcrawl

Recursive website crawler producing LLM-ready knowledge base artifacts


ragcrawl is a Python library for crawling websites and producing clean, structured content optimized for Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems.

  • Quick Start


    Get up and running in minutes with our quickstart guide.

    Getting Started

  • User Guide


    Learn how to crawl websites, sync updates, and export content.

    User Guide

  • Configuration


    Customize crawler behavior, storage, and output formats.

    Configuration

  • API Reference


    Complete Python API documentation with examples.

    API Reference

Features

Clean Markdown Output

Convert web pages to clean, readable Markdown while preserving semantic structure like headings, code blocks, and lists.

Python
from ragcrawl import CrawlJob, CrawlerConfig

config = CrawlerConfig(
    seeds=["https://docs.example.com"],
    max_pages=100,
)

job = CrawlJob(config)
result = await job.run()

for doc in result.documents:
    print(f"# {doc.title}\n{doc.markdown[:200]}...")

Incremental Sync

Efficiently update your knowledge base with only changed content using sitemap detection, ETags, and content hashing.

Python
from ragcrawl import SyncJob, SyncConfig

config = SyncConfig(
    site_id="site_abc123",
    use_sitemap=True,
)

job = SyncJob(config)
result = await job.run()

print(f"New: {result.stats.pages_new}")
print(f"Updated: {result.stats.pages_changed}")

RAG-Ready Chunking

Built-in chunking strategies optimized for embedding models with heading-aware and token-based options.

Python
from ragcrawl.chunking import HeadingChunker

chunker = HeadingChunker(max_tokens=500)
chunks = chunker.chunk_documents(result.documents)

for chunk in chunks:
    # Ready for embedding API
    print(f"Section: {chunk.section_path}")
    print(f"Tokens: {chunk.token_estimate}")

Flexible Storage

Choose between DuckDB for local development or DynamoDB for cloud deployments.

Python
from ragcrawl.config import StorageConfig, DuckDBConfig

config = StorageConfig(
    backend=DuckDBConfig(path="./crawler.duckdb")
)
Python
from ragcrawl.config import StorageConfig, DynamoDBConfig

config = StorageConfig(
    backend=DynamoDBConfig(
        table_prefix="ragcrawl_",
        region="us-west-2",
    )
)

Installation

Bash
# Basic installation
pip install ragcrawl

# With browser rendering
pip install ragcrawl[browser]

# With DynamoDB support
pip install ragcrawl[dynamodb]

# Full installation
pip install ragcrawl[all]
Bash
uv pip install ragcrawl

CLI Quick Start

Bash
# Crawl a documentation site
ragcrawl crawl https://docs.example.com --max-pages 100

# Sync for updates
ragcrawl sync site_abc123

# List crawled sites
ragcrawl sites

# View crawl history
ragcrawl runs site_abc123

Use Cases

Use Case Description
Documentation RAG Build Q&A systems from technical docs
Knowledge Base Create searchable internal wikis
Content Migration Extract structured content from websites
Research Collect and analyze web content

Architecture

Text Only
┌─────────────┐     ┌──────────┐     ┌───────────┐
│   Fetcher   │────▶│ Extractor│────▶│  Storage  │
│ (HTTP/Browser)    │ (HTML→MD)│     │(DuckDB/Dyn)
└─────────────┘     └──────────┘     └───────────┘
       │                                    │
       ▼                                    ▼
┌─────────────┐     ┌──────────┐     ┌───────────┐
│  Frontier   │     │ Chunker  │◀────│  Export   │
│   Queue     │     │(Heading/ │     │(JSON/JSONL)
└─────────────┘     │  Token)  │     └───────────┘
                    └──────────┘
                    ┌───────────┐
                    │ Publisher │
                    │(Single/   │
                    │  Multi)   │
                    └───────────┘

Next Steps

Community

We welcome contributions from the community! Here's how you can get involved:

License

ragcrawl is licensed under the Apache License 2.0. See the LICENSE file for details.