ragcrawl¶

Recursive website crawler producing LLM-ready knowledge base artifacts

ragcrawl is a Python library for crawling websites and producing clean, structured content optimized for Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems.

Quick Start

Get up and running in minutes with our quickstart guide.

Getting Started
User Guide

Learn how to crawl websites, sync updates, and export content.

User Guide
Configuration

Customize crawler behavior, storage, and output formats.

Configuration
API Reference

Complete Python API documentation with examples.

API Reference

Features¶

Clean Markdown Output¶

Convert web pages to clean, readable Markdown while preserving semantic structure like headings, code blocks, and lists.

Python

from ragcrawl import CrawlJob, CrawlerConfig

config = CrawlerConfig(
    seeds=["https://docs.example.com"],
    max_pages=100,
)

job = CrawlJob(config)
result = await job.run()

for doc in result.documents:
    print(f"# {doc.title}\n{doc.markdown[:200]}...")

Incremental Sync¶

Efficiently update your knowledge base with only changed content using sitemap detection, ETags, and content hashing.

Python

from ragcrawl import SyncJob, SyncConfig

config = SyncConfig(
    site_id="site_abc123",
    use_sitemap=True,
)

job = SyncJob(config)
result = await job.run()

print(f"New: {result.stats.pages_new}")
print(f"Updated: {result.stats.pages_changed}")

RAG-Ready Chunking¶

Built-in chunking strategies optimized for embedding models with heading-aware and token-based options.

Python

from ragcrawl.chunking import HeadingChunker

chunker = HeadingChunker(max_tokens=500)
chunks = chunker.chunk_documents(result.documents)

for chunk in chunks:
    # Ready for embedding API
    print(f"Section: {chunk.section_path}")
    print(f"Tokens: {chunk.token_estimate}")

Flexible Storage¶

Choose between DuckDB for local development or DynamoDB for cloud deployments.

DuckDB (Local)DynamoDB (Cloud)

Python

from ragcrawl.config import StorageConfig, DuckDBConfig

config = StorageConfig(
    backend=DuckDBConfig(path="./crawler.duckdb")
)

Python

from ragcrawl.config import StorageConfig, DynamoDBConfig

config = StorageConfig(
    backend=DynamoDBConfig(
        table_prefix="ragcrawl_",
        region="us-west-2",
    )
)

Installation¶

pipuv

Bash

# Basic installation
pip install ragcrawl

# With browser rendering
pip install ragcrawl[browser]

# With DynamoDB support
pip install ragcrawl[dynamodb]

# Full installation
pip install ragcrawl[all]

Bash

uv pip install ragcrawl

CLI Quick Start¶

Bash

# Crawl a documentation site
ragcrawl crawl https://docs.example.com --max-pages 100

# Sync for updates
ragcrawl sync site_abc123

# List crawled sites
ragcrawl sites

# View crawl history
ragcrawl runs site_abc123

Use Cases¶

Use Case	Description
Documentation RAG	Build Q&A systems from technical docs
Knowledge Base	Create searchable internal wikis
Content Migration	Extract structured content from websites
Research	Collect and analyze web content

Architecture¶

Text Only

┌─────────────┐     ┌──────────┐     ┌───────────┐
│   Fetcher   │────▶│ Extractor│────▶│  Storage  │
│ (HTTP/Browser)    │ (HTML→MD)│     │(DuckDB/Dyn)
└─────────────┘     └──────────┘     └───────────┘
       │                                    │
       ▼                                    ▼
┌─────────────┐     ┌──────────┐     ┌───────────┐
│  Frontier   │     │ Chunker  │◀────│  Export   │
│   Queue     │     │(Heading/ │     │(JSON/JSONL)
└─────────────┘     │  Token)  │     └───────────┘
                    └──────────┘
                          │
                          ▼
                    ┌───────────┐
                    │ Publisher │
                    │(Single/   │
                    │  Multi)   │
                    └───────────┘

Next Steps¶

Installation

Detailed installation instructions
Quickstart

Start crawling in 5 minutes
CLI Reference

Command-line interface guide
GitHub

Report issues and contribute

Community¶

We welcome contributions from the community! Here's how you can get involved:

Contributing

Learn how to contribute code, documentation, and more
Code of Conduct

Our community standards and expectations
Support

Get help and report issues
Changelog

Release history and updates

License¶

ragcrawl is licensed under the Apache License 2.0. See the LICENSE file for details.