DynamoDB Backend¶
The DynamoDB backend provides scalable cloud storage using AWS DynamoDB.
Overview¶
DynamoDB is ideal for:
- Cloud-native deployments
- Distributed crawling systems
- High availability requirements
- Serverless architectures
Installation¶
Install with DynamoDB support:
Configuration¶
Python
from ragcrawl.config.storage_config import StorageConfig, DynamoDBConfig
config = StorageConfig(
backend=DynamoDBConfig(
table_prefix="ragcrawl_",
region="us-west-2",
endpoint_url=None, # Use for local DynamoDB
)
)
Configuration Options¶
| Option | Type | Default | Description |
|---|---|---|---|
table_prefix |
str | "ragcrawl_" |
Prefix for table names |
region |
str | "us-east-1" |
AWS region |
endpoint_url |
str | None |
Custom endpoint (for local) |
AWS Credentials¶
The backend uses standard AWS credential resolution:
- Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) - AWS credentials file (
~/.aws/credentials) - IAM role (for EC2/Lambda)
Usage¶
Basic Usage¶
Python
from ragcrawl.storage import create_storage_backend
backend = create_storage_backend(config)
backend.initialize() # Creates tables if needed
sites = backend.list_sites()
backend.close()
Local Development¶
Use DynamoDB Local for development:
Python
config = StorageConfig(
backend=DynamoDBConfig(
table_prefix="dev_",
region="us-east-1",
endpoint_url="http://localhost:8000",
)
)
Table Structure¶
Tables Created¶
| Table | Partition Key | Sort Key | Description |
|---|---|---|---|
{prefix}sites |
site_id |
- | Site records |
{prefix}runs |
site_id |
run_id |
Crawl runs |
{prefix}pages |
site_id |
page_id |
Pages |
{prefix}versions |
page_id |
version_id |
Content versions |
{prefix}frontier |
run_id |
item_id |
Queue items |
Global Secondary Indexes¶
runs: GSI onrun_idfor direct lookuppages: GSI onurlfor URL lookups
IAM Permissions¶
Required IAM permissions:
JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:CreateTable",
"dynamodb:DescribeTable",
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:DeleteItem",
"dynamodb:Query",
"dynamodb:Scan"
],
"Resource": "arn:aws:dynamodb:*:*:table/ragcrawl_*"
}
]
}
Cost Optimization¶
On-Demand vs Provisioned¶
By default, tables use on-demand capacity. For predictable workloads, consider provisioned capacity:
Python
# Set via AWS Console or CLI after table creation
aws dynamodb update-table \
--table-name ragcrawl_pages \
--billing-mode PROVISIONED \
--provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=50
TTL for Frontier Items¶
Enable TTL on frontier table to auto-delete old items:
API Reference¶
DynamoDBBackend
¶
Bases: StorageBackend
DynamoDB storage backend implementation using PynamoDB.
Initialize DynamoDB backend.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
DynamoDB configuration.
TYPE:
|
Source code in src/ragcrawl/storage/dynamodb/backend.py
initialize
¶
Create tables if they don't exist.
Source code in src/ragcrawl/storage/dynamodb/backend.py
close
¶
health_check
¶
Check if DynamoDB is accessible.