DuckDB Backend¶
The DuckDB backend provides local file-based storage using the DuckDB embedded database.
Overview¶
DuckDB is the default storage backend, ideal for:
- Local development and testing
- Single-machine deployments
- Small to medium crawls (up to millions of pages)
- Fast SQL queries on crawled data
Configuration¶
Python
from ragcrawl.config.storage_config import StorageConfig, DuckDBConfig
config = StorageConfig(
backend=DuckDBConfig(
path="./crawler.duckdb", # Database file path
read_only=False, # Read-only mode
)
)
Configuration Options¶
| Option | Type | Default | Description |
|---|---|---|---|
path |
str | "./crawler.duckdb" |
Database file path |
read_only |
bool | False |
Open in read-only mode |
Usage¶
Basic Usage¶
Python
from ragcrawl.storage import create_storage_backend
backend = create_storage_backend(config)
backend.initialize()
# Use the backend
sites = backend.list_sites()
# Close when done
backend.close()
With Context Manager¶
Python
with create_storage_backend(config) as backend:
backend.initialize()
sites = backend.list_sites()
Direct SQL Queries¶
You can access the underlying DuckDB connection for custom queries:
Python
from ragcrawl.storage.duckdb import DuckDBBackend
backend = DuckDBBackend(duckdb_config)
backend.initialize()
# Run custom SQL
result = backend.conn.execute("""
SELECT url, status_code, last_crawled
FROM pages
WHERE site_id = ?
ORDER BY last_crawled DESC
LIMIT 10
""", [site_id]).fetchall()
Database Schema¶
Tables¶
| Table | Description |
|---|---|
sites |
Site configurations |
crawl_runs |
Crawl execution records |
pages |
Page state and metadata |
page_versions |
Content version history |
frontier_items |
URL queue items |
Example Queries¶
Find recently changed pages:
SQL
SELECT url, last_changed
FROM pages
WHERE site_id = 'site_abc123'
AND last_changed > CURRENT_TIMESTAMP - INTERVAL 7 DAY
ORDER BY last_changed DESC;
Get crawl statistics:
SQL
SELECT
COUNT(*) as total_pages,
COUNT(CASE WHEN status_code = 200 THEN 1 END) as successful,
COUNT(CASE WHEN is_tombstone THEN 1 END) as deleted
FROM pages
WHERE site_id = 'site_abc123';
Find pages with errors:
SQL
SELECT url, status_code, last_error
FROM pages
WHERE site_id = 'site_abc123'
AND error_count > 0
ORDER BY error_count DESC;
Performance Tips¶
Indexing¶
The schema includes indexes for common queries. For custom queries, consider adding indexes:
Vacuuming¶
Periodically vacuum the database:
Memory Settings¶
For large crawls, increase memory limit:
API Reference¶
DuckDBBackend
¶
Bases: StorageBackend
DuckDB storage backend implementation.
Provides local file-based storage with SQL capabilities.
Initialize DuckDB backend.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
DuckDB configuration.
TYPE:
|