User Guide¶
This guide covers the core functionality of ragcrawl in detail.
Overview¶
ragcrawl provides a complete pipeline for converting websites into LLM-ready knowledge bases:
graph LR
A[Web Pages] --> B[Fetcher]
B --> C[Extractor]
C --> D[Storage]
D --> E[Chunker]
E --> F[Exporter]
F --> G[Output Files]
Guide Contents¶
Crawling Websites¶
Learn how to crawl websites effectively:
- Starting a basic crawl
- Configuring URL filters
- Handling JavaScript-rendered content
- Respecting robots.txt and rate limits
- Managing large-scale crawls
Incremental Sync¶
Keep your knowledge base up-to-date:
- Understanding sync strategies
- Using sitemaps for efficient updates
- Conditional requests with ETags
- Content change detection
- Handling deleted pages
Chunking Content¶
Prepare content for embedding models:
- Heading-aware chunking
- Token-based chunking
- Configuring chunk sizes
- Preserving context in chunks
- Metadata in chunks
Exporting Data¶
Export your crawled data:
- JSON and JSONL formats
- Single-page combined output
- Multi-page with preserved structure
- Link rewriting for local files
- Custom export formats
Common Workflows¶
Documentation Site to RAG¶
Python
from ragcrawl import CrawlJob, CrawlerConfig
config = CrawlerConfig(
seeds=["https://docs.example.com"],
max_pages=500,
include_patterns=["/docs/*"],
)
job = CrawlJob(config)
result = await job.run()
# Chunk for embeddings
from ragcrawl.chunking import HeadingChunker
chunker = HeadingChunker(max_tokens=500)
chunks = chunker.chunk_documents(result.documents)
Keep Knowledge Base Fresh¶
Python
from ragcrawl import SyncJob, SyncConfig
config = SyncConfig(
site_id="site_abc123",
use_sitemap=True,
use_conditional_requests=True,
)
job = SyncJob(config)
result = await job.run()
print(f"Updated: {result.stats.pages_changed}")
print(f"New: {result.stats.pages_new}")
print(f"Deleted: {result.stats.pages_deleted}")
Best Practices¶
Start Small
Begin with a small max_pages limit to test your configuration before crawling an entire site.
Use Include Patterns
Focus your crawl on relevant content with include_patterns to avoid noise.
Enable Caching
Use DuckDB storage to enable efficient incremental syncs.
Respect Rate Limits
Always configure appropriate delays between requests to avoid overloading target servers.
Next Steps¶
- Configuration Reference - All configuration options
- CLI Reference - Command-line usage
- API Reference - Python API documentation