Skip to content

Site

The Site model represents a crawled website and its configuration.

Overview

Site stores:

  • Site identification (ID, name)
  • Seed URLs for crawling
  • Domain restrictions
  • Crawl statistics
  • Timestamps

Usage

Creating a Site

Sites are typically created automatically during crawling:

Python
from datetime import datetime, timezone
from ragcrawl.models import Site

site = Site(
    site_id="site_abc123",
    name="Example Documentation",
    seeds=["https://docs.example.com"],
    allowed_domains=["docs.example.com"],
    allowed_subdomains=True,
    created_at=datetime.now(timezone.utc),
    updated_at=datetime.now(timezone.utc),
)

Accessing Site Data

Python
# Get basic info
print(f"Site: {site.name}")
print(f"ID: {site.site_id}")
print(f"Seeds: {site.seeds}")

# Get statistics
print(f"Total pages: {site.total_pages}")
print(f"Total runs: {site.total_runs}")

# Check timestamps
print(f"Created: {site.created_at}")
print(f"Last crawl: {site.last_crawl_at}")

Listing Sites

Python
from ragcrawl.storage import create_storage_backend

backend = create_storage_backend(storage_config)
backend.initialize()

sites = backend.list_sites()
for site in sites:
    print(f"{site.site_id}: {site.name} ({site.total_pages} pages)")

Fields

Field Type Description
site_id str Unique site identifier
name str Human-readable name
seeds list[str] Starting URLs
allowed_domains list[str] Domains to crawl
allowed_subdomains bool Allow subdomains
config dict Configuration snapshot
created_at datetime Creation time
updated_at datetime Last update time
last_crawl_at datetime Last crawl time
last_sync_at datetime Last sync time
total_pages int Total pages crawled
total_runs int Total crawl runs
is_active bool Site is active

API Reference

Site

Bases: BaseModel

Represents a website/crawl target with its configuration.

Stores the configuration snapshot and metadata for a crawl target.