Skip to content

Models

Data models used throughout ragcrawl.

Overview

ragcrawl uses Pydantic models for type-safe data handling:

Model Description
Document Crawled page content for output
Site Crawl configuration and metadata
Page Page state and freshness tracking
Chunk Content chunks for embeddings
PageVersion Versioned content snapshots
CrawlRun Crawl execution records
FrontierItem URL queue items

Model Relationships

erDiagram
    Site ||--o{ CrawlRun : has
    Site ||--o{ Page : contains
    Page ||--o{ PageVersion : versions
    CrawlRun ||--o{ PageVersion : creates
    CrawlRun ||--o{ FrontierItem : manages
    PageVersion ||--|| Document : exports_to
    Document ||--o{ Chunk : chunks_into

Quick Reference

Document

The output model for crawled content:

Python
from ragcrawl.models import Document

doc = Document(
    doc_id="abc123",
    page_id="abc123",
    source_url="https://example.com/page",
    normalized_url="https://example.com/page",
    markdown="# Page Title\n\nContent here...",
    title="Page Title",
    status_code=200,
    content_type="text/html",
    depth=1,
    run_id="run_xyz",
    site_id="site_abc",
    first_seen=datetime.now(),
    last_seen=datetime.now(),
    last_crawled=datetime.now(),
)

Site

Configuration snapshot for a crawled site:

Python
from ragcrawl.models import Site

site = Site(
    site_id="site_abc123",
    name="Example Docs",
    seeds=["https://docs.example.com"],
    allowed_domains=["docs.example.com"],
    created_at=datetime.now(),
    updated_at=datetime.now(),
)

Chunk

Content chunk for embeddings:

Python
from ragcrawl.models import Chunk

chunk = Chunk(
    chunk_id="chunk_001",
    doc_id="abc123",
    content="This is the chunk content...",
    chunk_index=0,
    total_chunks=5,
    section_path=["Introduction", "Overview"],
    heading="Overview",
    token_estimate=150,
)

Module Reference

Data models for ragcrawl.

Document

Bases: BaseModel

A crawled document with rich metadata for LLM/RAG consumption.

This is the primary output model containing all extracted content and metadata from a crawled page.

Site

Bases: BaseModel

Represents a website/crawl target with its configuration.

Stores the configuration snapshot and metadata for a crawl target.

Page

Bases: BaseModel

Represents the current state of a URL in the crawl database.

This model tracks freshness information and points to the current version of the page content. It's used for incremental sync to determine what needs re-crawling.

needs_recrawl

Python
needs_recrawl(
    max_age_hours: float | None = None, force: bool = False
) -> bool

Determine if this page needs to be re-crawled.

PARAMETER DESCRIPTION
max_age_hours

Maximum age in hours before recrawl. None means always recrawl.

TYPE: float | None DEFAULT: None

force

If True, always return True.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
bool

True if the page should be re-crawled.

Source code in src/ragcrawl/models/page.py
Python
def needs_recrawl(
    self,
    max_age_hours: float | None = None,
    force: bool = False,
) -> bool:
    """
    Determine if this page needs to be re-crawled.

    Args:
        max_age_hours: Maximum age in hours before recrawl. None means always recrawl.
        force: If True, always return True.

    Returns:
        True if the page should be re-crawled.
    """
    if force:
        return True

    if self.is_tombstone:
        return False

    if self.last_crawled is None:
        return True

    if max_age_hours is None:
        return True

    age = datetime.now() - self.last_crawled
    return age.total_seconds() / 3600 > max_age_hours

PageVersion

Bases: BaseModel

Represents a specific version of a page's content.

Each time content changes, a new PageVersion is created. This enables version history and change tracking for KB updates.

Chunk

Bases: BaseModel

Represents a chunk of content for RAG/embedding pipelines.

Chunks are segments of a document optimized for vector embedding and retrieval, with metadata for context reconstruction.

is_first property

Python
is_first: bool

Check if this is the first chunk.

is_last property

Python
is_last: bool

Check if this is the last chunk.

CrawlRun

Bases: BaseModel

Represents a single crawl or sync execution.

Tracks the status, configuration snapshot, and statistics for a crawl run.

duration_seconds property

Python
duration_seconds: float | None

Get run duration in seconds.

mark_started

Python
mark_started() -> None

Mark the run as started.

Source code in src/ragcrawl/models/crawl_run.py
Python
def mark_started(self) -> None:
    """Mark the run as started."""
    self.status = RunStatus.RUNNING
    self.started_at = datetime.now()

mark_completed

Python
mark_completed(partial: bool = False) -> None

Mark the run as completed.

Source code in src/ragcrawl/models/crawl_run.py
Python
def mark_completed(self, partial: bool = False) -> None:
    """Mark the run as completed."""
    self.status = RunStatus.PARTIAL if partial else RunStatus.COMPLETED
    self.completed_at = datetime.now()

mark_failed

Python
mark_failed(error: str) -> None

Mark the run as failed.

Source code in src/ragcrawl/models/crawl_run.py
Python
def mark_failed(self, error: str) -> None:
    """Mark the run as failed."""
    self.status = RunStatus.FAILED
    self.error_message = error
    self.completed_at = datetime.now()

mark_cancelled

Python
mark_cancelled() -> None

Mark the run as cancelled.

Source code in src/ragcrawl/models/crawl_run.py
Python
def mark_cancelled(self) -> None:
    """Mark the run as cancelled."""
    self.status = RunStatus.CANCELLED
    self.completed_at = datetime.now()

FrontierItem

Bases: BaseModel

Represents a URL in the crawl frontier queue.

Used for pause/resume functionality and tracking crawl progress.

mark_in_progress

Python
mark_in_progress() -> None

Mark item as being crawled.

Source code in src/ragcrawl/models/frontier_item.py
Python
def mark_in_progress(self) -> None:
    """Mark item as being crawled."""
    self.status = FrontierStatus.IN_PROGRESS
    self.started_at = datetime.now()

mark_completed

Python
mark_completed() -> None

Mark item as successfully crawled.

Source code in src/ragcrawl/models/frontier_item.py
Python
def mark_completed(self) -> None:
    """Mark item as successfully crawled."""
    self.status = FrontierStatus.COMPLETED
    self.completed_at = datetime.now()

mark_failed

Python
mark_failed(error: str) -> None

Mark item as failed.

Source code in src/ragcrawl/models/frontier_item.py
Python
def mark_failed(self, error: str) -> None:
    """Mark item as failed."""
    self.status = FrontierStatus.FAILED
    self.last_error = error
    self.retry_count += 1
    self.completed_at = datetime.now()

mark_skipped

Python
mark_skipped(reason: str) -> None

Mark item as skipped.

Source code in src/ragcrawl/models/frontier_item.py
Python
def mark_skipped(self, reason: str) -> None:
    """Mark item as skipped."""
    self.status = FrontierStatus.SKIPPED
    self.last_error = reason
    self.completed_at = datetime.now()