Storage API¶
ragcrawl supports pluggable storage backends for persisting crawl data.
Overview¶
| Backend | Description | Use Case |
|---|---|---|
| DuckDB | Local file-based SQL database | Default, local development |
| DynamoDB | AWS managed NoSQL database | Cloud deployments, scalability |
Storage Backend Interface¶
All backends implement the StorageBackend protocol:
from ragcrawl.storage import StorageBackend
class StorageBackend(Protocol):
# Lifecycle
def initialize(self) -> None: ...
def close(self) -> None: ...
def health_check(self) -> bool: ...
# Sites
def save_site(self, site: Site) -> None: ...
def get_site(self, site_id: str) -> Site | None: ...
def list_sites(self) -> list[Site]: ...
# Crawl Runs
def save_run(self, run: CrawlRun) -> None: ...
def get_run(self, run_id: str) -> CrawlRun | None: ...
def list_runs(self, site_id: str) -> list[CrawlRun]: ...
# Pages
def save_page(self, page: Page) -> None: ...
def get_page(self, page_id: str) -> Page | None: ...
def get_page_by_url(self, site_id: str, url: str) -> Page | None: ...
def list_pages(self, site_id: str) -> list[Page]: ...
# Versions
def save_version(self, version: PageVersion) -> None: ...
def get_version(self, version_id: str) -> PageVersion | None: ...
def list_versions(self, page_id: str) -> list[PageVersion]: ...
# Frontier
def save_frontier_item(self, item: FrontierItem) -> None: ...
def get_frontier_items(self, run_id: str) -> list[FrontierItem]: ...
Quick Start¶
Using DuckDB (Default)¶
from ragcrawl.config.storage_config import StorageConfig, DuckDBConfig
from ragcrawl.storage import create_storage_backend
config = StorageConfig(
backend=DuckDBConfig(path="./crawler.duckdb")
)
backend = create_storage_backend(config)
backend.initialize()
# Use the backend
sites = backend.list_sites()
Using DynamoDB¶
from ragcrawl.config.storage_config import StorageConfig, DynamoDBConfig
from ragcrawl.storage import create_storage_backend
config = StorageConfig(
backend=DynamoDBConfig(
table_prefix="ragcrawl_",
region="us-west-2",
)
)
backend = create_storage_backend(config)
backend.initialize()
Factory Function¶
Use create_storage_backend() to create backends:
from ragcrawl.storage import create_storage_backend
# Automatically selects backend based on config
backend = create_storage_backend(storage_config)
Context Manager¶
Backends support context manager protocol:
with create_storage_backend(config) as backend:
backend.initialize()
sites = backend.list_sites()
# Automatically closed
Module Reference¶
Storage backend protocol and factory.
StorageBackend
¶
Bases: ABC
Abstract base class for storage backends.
All backends must implement this interface to ensure feature parity.
delete_site
abstractmethod
¶
list_runs
abstractmethod
¶
list_runs(
site_id: str, limit: int = 100, offset: int = 0
) -> list[CrawlRun]
list_pages
abstractmethod
¶
list_pages(
site_id: str,
limit: int = 1000,
offset: int = 0,
include_tombstones: bool = False,
) -> list[Page]
get_pages_needing_recrawl
abstractmethod
¶
get_pages_needing_recrawl(
site_id: str,
max_age_hours: float | None = None,
limit: int = 1000,
) -> list[Page]
Get pages that need to be re-crawled.
count_pages
abstractmethod
¶
save_version
abstractmethod
¶
save_version(version: PageVersion) -> None
get_version
abstractmethod
¶
get_version(version_id: str) -> PageVersion | None
get_current_version
abstractmethod
¶
get_current_version(page_id: str) -> PageVersion | None
list_versions
abstractmethod
¶
list_versions(
page_id: str, limit: int = 100
) -> list[PageVersion]
save_frontier_item
abstractmethod
¶
save_frontier_item(item: FrontierItem) -> None
get_frontier_items
abstractmethod
¶
get_frontier_items(
run_id: str,
status: str | None = None,
limit: int = 1000,
) -> list[FrontierItem]
update_frontier_status
abstractmethod
¶
clear_frontier
abstractmethod
¶
save_versions_bulk
abstractmethod
¶
save_versions_bulk(versions: list[PageVersion]) -> int
initialize
abstractmethod
¶
close
abstractmethod
¶
create_storage_backend
¶
create_storage_backend(
config: StorageConfig,
) -> StorageBackend
Create a storage backend from configuration.
Falls back to DuckDB if the configured backend is unavailable and fail_if_unavailable is False.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Storage configuration.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
StorageBackend
|
A StorageBackend instance. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If backend unavailable and fail_if_unavailable is True. |