Storage Configuration¶

Configure where crawl data is stored.

Overview¶

ragcrawl supports two storage backends:

DuckDB (default): Local file-based storage, great for development and single-machine deployments
DynamoDB: AWS cloud storage, suitable for distributed and serverless deployments

DuckDB Configuration¶

Basic Setup¶

Python

from ragcrawl.config.storage_config import DuckDBConfig, StorageConfig

storage = StorageConfig(
    backend=DuckDBConfig(path="./crawler.duckdb")
)

Options¶

Option	Type	Default	Description
`path`	`str`	`"./crawler.duckdb"`	Path to database file
`read_only`	`bool`	`False`	Open in read-only mode

CLI Usage¶

Bash

# Default storage
ragcrawl crawl https://example.com

# Custom storage path
ragcrawl crawl https://example.com --storage ./data/my-crawl.duckdb

Multiple Databases¶

Python

# Separate databases for different projects
project_a = StorageConfig(backend=DuckDBConfig(path="./project-a.duckdb"))
project_b = StorageConfig(backend=DuckDBConfig(path="./project-b.duckdb"))

DynamoDB Configuration¶

Prerequisites¶

Install DynamoDB support:
Bash
```
pip install ragcrawl[dynamodb]
```

Configure AWS credentials:

Bash

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

Basic Setup¶

Python

from ragcrawl.config.storage_config import DynamoDBConfig, StorageConfig

storage = StorageConfig(
    backend=DynamoDBConfig(
        table_prefix="myapp",
        region="us-east-1",
    )
)

Options¶

Option	Type	Default	Description
`table_prefix`	`str`	`"ragcrawl"`	Prefix for table names
`region`	`str`	`"us-east-1"`	AWS region
`endpoint_url`	`str`	`None`	Custom endpoint (for local testing)
`create_tables`	`bool`	`True`	Auto-create tables if missing

Table Structure¶

The following tables are created:

{prefix}-sites: Site configurations
{prefix}-runs: Crawl run records
{prefix}-pages: Page metadata
{prefix}-versions: Page versions with content
{prefix}-frontier: Crawl queue items

Local Development with DynamoDB Local¶

Python

storage = StorageConfig(
    backend=DynamoDBConfig(
        table_prefix="dev",
        region="us-east-1",
        endpoint_url="http://localhost:8000",  # DynamoDB Local
    )
)

Run DynamoDB Local:

Bash

docker run -p 8000:8000 amazon/dynamodb-local

Fallback Configuration¶

Configure fallback when DynamoDB is unavailable:

Python

storage = StorageConfig(
    backend=DynamoDBConfig(
        table_prefix="prod",
        region="us-east-1",
    ),
    fallback=DuckDBConfig(path="./fallback.duckdb"),
    fail_if_unavailable=False,  # Use fallback instead of failing
)

Accessing Storage Directly¶

Python

from ragcrawl.storage.backend import create_storage_backend
from ragcrawl.config.storage_config import DuckDBConfig, StorageConfig

# Create backend
config = StorageConfig(backend=DuckDBConfig(path="./crawler.duckdb"))
backend = create_storage_backend(config)
backend.initialize()

# Query data
sites = backend.list_sites()
for site in sites:
    print(f"Site: {site.name}")
    pages = backend.list_pages(site.site_id)
    print(f"  Pages: {len(pages)}")

# Clean up
backend.close()

Data Schema¶

Sites Table¶

Column	Type	Description
site_id	string	Unique identifier
name	string	Site name
seeds	json	List of seed URLs
allowed_domains	json	Allowed domains
config	json	Crawler configuration
created_at	datetime	Creation timestamp
total_pages	integer	Total pages crawled
total_runs	integer	Total crawl runs

Pages Table¶

Column	Type	Description
page_id	string	Unique identifier
site_id	string	Parent site
url	string	Page URL
content_hash	string	Current content hash
first_seen	datetime	First crawl time
last_crawled	datetime	Last crawl time
is_tombstone	boolean	Deleted page marker

Versions Table¶

Column	Type	Description
version_id	string	Unique identifier
page_id	string	Parent page
run_id	string	Crawl run
markdown	text	Markdown content
content_hash	string	Content hash
title	string	Page title
crawled_at	datetime	Crawl timestamp

Best Practices¶

Development: Use DuckDB for fast local iteration
Production: Use DynamoDB for scalability and durability
Testing: Use DynamoDB Local or in-memory DuckDB
Backups: DuckDB files can be copied; DynamoDB has built-in backups
Migration: Export from one backend, import to another via JSON export