Skip to content

Configuration

ragcrawl is highly configurable to handle various crawling scenarios. This section covers all configuration options.

Configuration Overview

ragcrawl uses four main configuration classes:

Config Class Purpose Documentation
CrawlerConfig Controls crawl behavior Crawler Options
StorageConfig Database backend settings Storage Backends
OutputConfig Output format and files Output Settings
MarkdownConfig Markdown extraction and filtering Markdown Extraction

Quick Configuration Example

Python
from ragcrawl.config import CrawlerConfig, StorageConfig, OutputConfig
from ragcrawl.config.storage_config import DuckDBConfig

config = CrawlerConfig(
    # What to crawl
    seeds=["https://docs.example.com"],
    allowed_domains=["docs.example.com"],

    # Crawl limits
    max_pages=100,
    max_depth=5,

    # URL filtering
    include_patterns=["/docs/*", "/api/*"],
    exclude_patterns=["/blog/*", "*/print/*"],

    # Politeness
    delay_seconds=1.0,
    robots_mode="strict",

    # Storage
    storage=StorageConfig(
        backend=DuckDBConfig(path="./crawler.duckdb")
    ),

    # Output
    output=OutputConfig(
        mode="multi",
        root_dir="./output",
        include_metadata=True,
    ),
)

CLI Configuration

You can also configure ragcrawl via command line:

Bash
ragcrawl crawl https://docs.example.com \
    --max-pages 100 \
    --max-depth 5 \
    --include "/docs/*" \
    --exclude "/blog/*" \
    --delay 1.0 \
    --output ./output

Configuration Files

ragcrawl supports YAML configuration files:

YAML
# ragcrawl.yaml
seeds:
  - https://docs.example.com

crawl:
  max_pages: 100
  max_depth: 5
  delay_seconds: 1.0

filters:
  include_patterns:
    - "/docs/*"
  exclude_patterns:
    - "/blog/*"

storage:
  backend: duckdb
  path: ./crawler.duckdb

output:
  mode: multi
  root_dir: ./output

Load with:

Bash
ragcrawl crawl --config ragcrawl.yaml

Environment Variables

Some settings can be configured via environment variables:

Variable Description Default
RAGCRAWL_DB_PATH DuckDB database path ./crawler.duckdb
RAGCRAWL_OUTPUT_DIR Default output directory ./output
RAGCRAWL_LOG_LEVEL Logging level INFO
RAGCRAWL_USER_AGENT Custom user agent ragcrawl/0.1

Configuration Sections

  • Crawler Options


    Configure crawl behavior, limits, URL filtering, and politeness settings.

  • Storage Backends


    Choose between DuckDB and DynamoDB for data persistence.

  • Output Settings


    Configure output formats, file structure, and export options.

  • Markdown Extraction


    Tune Crawl4AI filters and Markdown generator options for cleaner output.

Validation

Configuration is validated at runtime:

Python
from ragcrawl.config import CrawlerConfig
from pydantic import ValidationError

try:
    config = CrawlerConfig(
        seeds=[],  # Error: empty seeds
        max_pages=-1,  # Error: negative value
    )
except ValidationError as e:
    print(e.errors())

Next Steps