Getting Started¶

Welcome to ragcrawl! This section will help you get up and running quickly.

What is ragcrawl?¶

ragcrawl is a Python library for recursively crawling websites and producing LLM-ready knowledge base artifacts. It's designed specifically for:

RAG Systems: Generate clean, structured content for Retrieval-Augmented Generation
Documentation Crawling: Convert technical docs into searchable knowledge bases
Content Migration: Extract and structure content from existing websites
Knowledge Management: Build internal knowledge bases from company resources

Smart Crawling

Respects robots.txt, handles rate limiting, and prevents duplicate content with intelligent URL normalization.
Clean Markdown

Converts web pages to clean, readable Markdown while preserving semantic structure.
Incremental Sync

Efficiently update your knowledge base with only changed content using sitemap, ETags, and content hashing.
Flexible Chunking

Heading-aware and token-based chunking optimized for embedding models.

Topic	Description
Installation	Install ragcrawl and its dependencies
Quickstart	Start crawling in 5 minutes
CLI Reference	Command-line interface guide
Configuration	Customize crawler behavior