Crawling Guide¶
This guide covers all aspects of crawling websites with ragcrawl.
How Crawling Works¶
- Seed URLs: The crawler starts from one or more seed URLs
- Fetch: Each page is fetched via HTTP or browser rendering
- Extract: Content is converted to Markdown, links are extracted
- Filter: Links are filtered based on domain, patterns, and depth
- Queue: New links are added to the priority queue (frontier)
- Store: Page content and metadata are stored
- Repeat: Process continues until limits are reached
CLI Usage¶
Basic Crawl¶
Full Options¶
Bash
ragcrawl crawl https://docs.example.com \
--max-pages 500 \
--max-depth 10 \
--output ./output \
--output-mode multi \
--storage ./crawler.duckdb \
--include "/docs/.*" \
--exclude "/admin/.*" \
--robots \
--js \
--export-json ./docs.json \
--export-jsonl ./docs.jsonl \
--verbose
CLI Options Reference¶
| Option | Short | Description |
|---|---|---|
--max-pages |
-m |
Maximum pages to crawl (default: from config) |
--max-depth |
-d |
Maximum crawl depth (default: from config) |
--output |
-o |
Output directory |
--output-mode |
single or multi page output |
|
--storage |
-s |
DuckDB storage path (default: ~/.ragcrawl/ragcrawl.duckdb) |
--include |
-i |
Include URL patterns (regex, repeatable) |
--exclude |
-e |
Exclude URL patterns (regex, repeatable) |
--robots/--no-robots |
Respect robots.txt (default: enabled) | |
--js/--no-js |
Enable JavaScript rendering | |
--export-json |
Export documents to JSON file | |
--export-jsonl |
Export documents to JSONL file | |
--markdown-config |
Path to MarkdownConfig overrides (TOML/JSON) | |
--verbose |
-v |
Verbose output |
Multiple Seeds¶
You can specify multiple seed URLs:
Crawl Modes¶
HTTP Mode (Default)¶
Fast crawling using HTTP requests. Best for static sites:
Python
from ragcrawl.config.crawler_config import CrawlerConfig, FetchMode
config = CrawlerConfig(
seeds=["https://docs.example.com"],
fetch_mode=FetchMode.HTTP,
)
Browser Mode¶
Uses headless Chromium for JavaScript-heavy sites:
CLI:
Hybrid Mode¶
Tries HTTP first, falls back to browser on failure:
URL Filtering¶
Domain Restrictions¶
By default, the crawler stays within the seed domains:
Python
config = CrawlerConfig(
seeds=["https://docs.example.com"],
allowed_domains=["docs.example.com", "api.example.com"],
allow_subdomains=True, # Also allows sub.docs.example.com
)
Path Patterns¶
Use regex patterns to include or exclude URLs:
Python
config = CrawlerConfig(
seeds=["https://example.com"],
include_patterns=[
r"/docs/.*", # Only crawl /docs/
r"/api/v\d+/.*", # API versioned paths
],
exclude_patterns=[
r"/admin/.*", # Skip admin pages
r".*\.pdf$", # Skip PDFs
],
)
Depth Limiting¶
Control how deep the crawler goes from seed URLs:
Python
config = CrawlerConfig(
seeds=["https://example.com"],
max_depth=3, # Maximum 3 clicks from seed
)
Rate Limiting¶
Per-Domain Rate Limits¶
Respect website resources:
Python
config = CrawlerConfig(
seeds=["https://example.com"],
requests_per_second=2.0, # Max 2 requests/second per domain
concurrent_requests=5, # Max 5 concurrent requests total
)
Delays¶
Add delays between requests:
Python
config = CrawlerConfig(
seeds=["https://example.com"],
delay_range=(1.0, 3.0), # Random delay 1-3 seconds
)
Robots.txt Compliance¶
Strict Mode (Default)¶
Respects all robots.txt directives:
Python
from ragcrawl.config.crawler_config import RobotsMode
config = CrawlerConfig(
seeds=["https://example.com"],
robots_mode=RobotsMode.STRICT,
)
Off Mode¶
Ignores robots.txt (use responsibly):
Page Limits¶
Total Pages¶
Python
config = CrawlerConfig(
seeds=["https://example.com"],
max_pages=1000, # Stop after 1000 pages
)
Per-Domain Limits¶
Python
config = CrawlerConfig(
seeds=["https://example.com", "https://other.com"],
max_pages_per_domain=500, # Max 500 pages per domain
)
Error Handling¶
Retry Configuration¶
Python
config = CrawlerConfig(
seeds=["https://example.com"],
max_retries=3, # Retry failed requests
retry_delay=5.0, # Wait 5 seconds between retries
timeout=30.0, # 30 second timeout per request
)
Circuit Breaker¶
The crawler automatically stops requesting from failing domains:
Python
config = CrawlerConfig(
seeds=["https://example.com"],
circuit_breaker_threshold=10, # Stop after 10 consecutive failures
circuit_breaker_reset=300, # Reset after 5 minutes
)
Hooks and Callbacks¶
On Page Crawled¶
Python
from ragcrawl.hooks.callbacks import HookManager
def on_page(url: str, content: str, metadata: dict):
print(f"Crawled: {url} ({len(content)} chars)")
hooks = HookManager()
hooks.on_page_crawled(on_page)
config = CrawlerConfig(
seeds=["https://example.com"],
hooks=hooks,
)
On Error¶
Python
def on_error(url: str, error: Exception):
print(f"Error crawling {url}: {error}")
hooks.on_error(on_error)
Content Processing¶
Redaction¶
Remove sensitive content before storage:
Python
from ragcrawl.hooks.callbacks import PatternRedactor
redactor = PatternRedactor([
r"\b\d{3}-\d{2}-\d{4}\b", # SSN pattern
r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", # Email
])
hooks = HookManager()
hooks.add_redactor(redactor)
Complete Example¶
Python
import asyncio
from ragcrawl.config.crawler_config import (
CrawlerConfig,
FetchMode,
RobotsMode,
)
from ragcrawl.config.output_config import OutputConfig, OutputMode
from ragcrawl.config.storage_config import DuckDBConfig, StorageConfig
from ragcrawl.core.crawl_job import CrawlJob
async def crawl_documentation():
config = CrawlerConfig(
# Seeds
seeds=["https://docs.example.com"],
# Limits
max_pages=1000,
max_depth=10,
# Filtering
include_patterns=[r"/docs/.*", r"/tutorials/.*"],
exclude_patterns=[r"/admin/.*"],
# Fetching
fetch_mode=FetchMode.HTTP,
robots_mode=RobotsMode.STRICT,
requests_per_second=2.0,
concurrent_requests=5,
timeout=30.0,
# Storage
storage=StorageConfig(
backend=DuckDBConfig(path="./docs.duckdb")
),
# Output
output=OutputConfig(
mode=OutputMode.MULTI,
root_dir="./docs-output",
include_metadata=True,
rewrite_links=True,
),
)
job = CrawlJob(config)
result = await job.run()
if result.success:
print(f"Successfully crawled {result.stats.pages_crawled} pages")
print(f"Failed: {result.stats.pages_failed}")
print(f"Duration: {result.duration_seconds:.1f}s")
else:
print(f"Crawl failed: {result.error}")
asyncio.run(crawl_documentation())