Getting Started¶
Welcome to ragcrawl! This section will help you get up and running quickly.
What is ragcrawl?¶
ragcrawl is a Python library for recursively crawling websites and producing LLM-ready knowledge base artifacts. It's designed specifically for:
- RAG Systems: Generate clean, structured content for Retrieval-Augmented Generation
- Documentation Crawling: Convert technical docs into searchable knowledge bases
- Content Migration: Extract and structure content from existing websites
- Knowledge Management: Build internal knowledge bases from company resources
Key Features¶
-
Smart Crawling
Respects robots.txt, handles rate limiting, and prevents duplicate content with intelligent URL normalization.
-
Clean Markdown
Converts web pages to clean, readable Markdown while preserving semantic structure.
-
Incremental Sync
Efficiently update your knowledge base with only changed content using sitemap, ETags, and content hashing.
-
Flexible Chunking
Heading-aware and token-based chunking optimized for embedding models.
Quick Links¶
| Topic | Description |
|---|---|
| Installation | Install ragcrawl and its dependencies |
| Quickstart | Start crawling in 5 minutes |
| CLI Reference | Command-line interface guide |
| Configuration | Customize crawler behavior |
System Requirements¶
- Python: 3.10 or higher
- OS: Linux, macOS, or Windows
- Memory: 512MB minimum (more for large sites)
- Storage: Varies based on crawled content
Next Steps¶
- Install ragcrawl - Get the library installed
- Follow the Quickstart - Crawl your first site
- Explore the User Guide - Learn advanced features