Skip to content

DuckDB Backend

The DuckDB backend provides local file-based storage using the DuckDB embedded database.

Overview

DuckDB is the default storage backend, ideal for:

  • Local development and testing
  • Single-machine deployments
  • Small to medium crawls (up to millions of pages)
  • Fast SQL queries on crawled data

Configuration

Python
from ragcrawl.config.storage_config import StorageConfig, DuckDBConfig

config = StorageConfig(
    backend=DuckDBConfig(
        path="./crawler.duckdb",  # Database file path
        read_only=False,          # Read-only mode
    )
)

Configuration Options

Option Type Default Description
path str "./crawler.duckdb" Database file path
read_only bool False Open in read-only mode

Usage

Basic Usage

Python
from ragcrawl.storage import create_storage_backend

backend = create_storage_backend(config)
backend.initialize()

# Use the backend
sites = backend.list_sites()

# Close when done
backend.close()

With Context Manager

Python
with create_storage_backend(config) as backend:
    backend.initialize()
    sites = backend.list_sites()

Direct SQL Queries

You can access the underlying DuckDB connection for custom queries:

Python
from ragcrawl.storage.duckdb import DuckDBBackend

backend = DuckDBBackend(duckdb_config)
backend.initialize()

# Run custom SQL
result = backend.conn.execute("""
    SELECT url, status_code, last_crawled
    FROM pages
    WHERE site_id = ?
    ORDER BY last_crawled DESC
    LIMIT 10
""", [site_id]).fetchall()

Database Schema

Tables

Table Description
sites Site configurations
crawl_runs Crawl execution records
pages Page state and metadata
page_versions Content version history
frontier_items URL queue items

Example Queries

Find recently changed pages:

SQL
SELECT url, last_changed
FROM pages
WHERE site_id = 'site_abc123'
  AND last_changed > CURRENT_TIMESTAMP - INTERVAL 7 DAY
ORDER BY last_changed DESC;

Get crawl statistics:

SQL
SELECT
    COUNT(*) as total_pages,
    COUNT(CASE WHEN status_code = 200 THEN 1 END) as successful,
    COUNT(CASE WHEN is_tombstone THEN 1 END) as deleted
FROM pages
WHERE site_id = 'site_abc123';

Find pages with errors:

SQL
SELECT url, status_code, last_error
FROM pages
WHERE site_id = 'site_abc123'
  AND error_count > 0
ORDER BY error_count DESC;

Performance Tips

Indexing

The schema includes indexes for common queries. For custom queries, consider adding indexes:

SQL
CREATE INDEX idx_pages_status ON pages(site_id, status_code);

Vacuuming

Periodically vacuum the database:

Python
backend.conn.execute("VACUUM")

Memory Settings

For large crawls, increase memory limit:

Python
backend.conn.execute("SET memory_limit='4GB'")

API Reference

DuckDBBackend

Python
DuckDBBackend(config: DuckDBConfig)

Bases: StorageBackend

DuckDB storage backend implementation.

Provides local file-based storage with SQL capabilities.

Initialize DuckDB backend.

PARAMETER DESCRIPTION
config

DuckDB configuration.

TYPE: DuckDBConfig

Source code in src/ragcrawl/storage/duckdb/backend.py
Python
def __init__(self, config: DuckDBConfig) -> None:
    """
    Initialize DuckDB backend.

    Args:
        config: DuckDB configuration.
    """
    self.config = config
    self.db_path = Path(config.path)
    self._conn: duckdb.DuckDBPyConnection | None = None

initialize

Python
initialize() -> None

Initialize the database schema.

Source code in src/ragcrawl/storage/duckdb/backend.py
Python
def initialize(self) -> None:
    """Initialize the database schema."""
    for schema_sql in get_all_schemas():
        self.conn.execute(schema_sql)

close

Python
close() -> None

Close the database connection.

Source code in src/ragcrawl/storage/duckdb/backend.py
Python
def close(self) -> None:
    """Close the database connection."""
    if self._conn is not None:
        self._conn.close()
        self._conn = None

health_check

Python
health_check() -> bool

Check if the database is accessible.

Source code in src/ragcrawl/storage/duckdb/backend.py
Python
def health_check(self) -> bool:
    """Check if the database is accessible."""
    try:
        self.conn.execute("SELECT 1").fetchone()
        return True
    except Exception:
        return False