Models¶

Data models used throughout ragcrawl.

Overview¶

ragcrawl uses Pydantic models for type-safe data handling:

Model	Description
Document	Crawled page content for output
Site	Crawl configuration and metadata
Page	Page state and freshness tracking
Chunk	Content chunks for embeddings
`PageVersion`	Versioned content snapshots
`CrawlRun`	Crawl execution records
`FrontierItem`	URL queue items

Model Relationships¶

erDiagram
    Site ||--o{ CrawlRun : has
    Site ||--o{ Page : contains
    Page ||--o{ PageVersion : versions
    CrawlRun ||--o{ PageVersion : creates
    CrawlRun ||--o{ FrontierItem : manages
    PageVersion ||--|| Document : exports_to
    Document ||--o{ Chunk : chunks_into

Quick Reference¶

Document¶

The output model for crawled content:

Python

from ragcrawl.models import Document

doc = Document(
    doc_id="abc123",
    page_id="abc123",
    source_url="https://example.com/page",
    normalized_url="https://example.com/page",
    markdown="# Page Title\n\nContent here...",
    title="Page Title",
    status_code=200,
    content_type="text/html",
    depth=1,
    run_id="run_xyz",
    site_id="site_abc",
    first_seen=datetime.now(),
    last_seen=datetime.now(),
    last_crawled=datetime.now(),
)

Site¶

Configuration snapshot for a crawled site:

Python

from ragcrawl.models import Site

site = Site(
    site_id="site_abc123",
    name="Example Docs",
    seeds=["https://docs.example.com"],
    allowed_domains=["docs.example.com"],
    created_at=datetime.now(),
    updated_at=datetime.now(),
)

Chunk¶

Content chunk for embeddings:

Python

from ragcrawl.models import Chunk

chunk = Chunk(
    chunk_id="chunk_001",
    doc_id="abc123",
    content="This is the chunk content...",
    chunk_index=0,
    total_chunks=5,
    section_path=["Introduction", "Overview"],
    heading="Overview",
    token_estimate=150,
)

Module Reference¶

Data models for ragcrawl.

Document ¶

Bases: BaseModel

A crawled document with rich metadata for LLM/RAG consumption.

This is the primary output model containing all extracted content and metadata from a crawled page.

Site ¶

Bases: BaseModel

Represents a website/crawl target with its configuration.

Stores the configuration snapshot and metadata for a crawl target.

Page ¶

Bases: BaseModel

Represents the current state of a URL in the crawl database.

This model tracks freshness information and points to the current version of the page content. It's used for incremental sync to determine what needs re-crawling.

needs_recrawl ¶

Python

needs_recrawl(
    max_age_hours: float | None = None, force: bool = False
) -> bool

Determine if this page needs to be re-crawled.

PARAMETER	DESCRIPTION
`max_age_hours`	Maximum age in hours before recrawl. None means always recrawl. TYPE: `float \| None` DEFAULT: `None`
`force`	If True, always return True. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`bool`	True if the page should be re-crawled.

Source code in src/ragcrawl/models/page.py

Python
def needs_recrawl(
    self,
    max_age_hours: float | None = None,
    force: bool = False,
) -> bool:
    """
    Determine if this page needs to be re-crawled.

    Args:
        max_age_hours: Maximum age in hours before recrawl. None means always recrawl.
        force: If True, always return True.

    Returns:
        True if the page should be re-crawled.
    """
    if force:
        return True

    if self.is_tombstone:
        return False

    if self.last_crawled is None:
        return True

    if max_age_hours is None:
        return True

    age = datetime.now() - self.last_crawled
    return age.total_seconds() / 3600 > max_age_hours

PageVersion ¶

Bases: BaseModel

Represents a specific version of a page's content.

Each time content changes, a new PageVersion is created. This enables version history and change tracking for KB updates.

Chunk ¶

Bases: BaseModel

Represents a chunk of content for RAG/embedding pipelines.

Chunks are segments of a document optimized for vector embedding and retrieval, with metadata for context reconstruction.

is_first `property` ¶

Python

is_first: bool

Check if this is the first chunk.

is_last `property` ¶

Python

is_last: bool

Check if this is the last chunk.

CrawlRun ¶

Bases: BaseModel

Represents a single crawl or sync execution.

Tracks the status, configuration snapshot, and statistics for a crawl run.

duration_seconds `property` ¶

Python

duration_seconds: float | None

Get run duration in seconds.

mark_started ¶

Python

mark_started() -> None

Mark the run as started.

Source code in src/ragcrawl/models/crawl_run.py

Python
def mark_started(self) -> None:
    """Mark the run as started."""
    self.status = RunStatus.RUNNING
    self.started_at = datetime.now()

mark_completed ¶

Python

mark_completed(partial: bool = False) -> None

Mark the run as completed.

Source code in src/ragcrawl/models/crawl_run.py

Python
def mark_completed(self, partial: bool = False) -> None:
    """Mark the run as completed."""
    self.status = RunStatus.PARTIAL if partial else RunStatus.COMPLETED
    self.completed_at = datetime.now()

mark_failed ¶

Python

mark_failed(error: str) -> None

Mark the run as failed.

Source code in src/ragcrawl/models/crawl_run.py

Python
def mark_failed(self, error: str) -> None:
    """Mark the run as failed."""
    self.status = RunStatus.FAILED
    self.error_message = error
    self.completed_at = datetime.now()

mark_cancelled ¶

Python

mark_cancelled() -> None

Mark the run as cancelled.

Source code in src/ragcrawl/models/crawl_run.py

Python
def mark_cancelled(self) -> None:
    """Mark the run as cancelled."""
    self.status = RunStatus.CANCELLED
    self.completed_at = datetime.now()

FrontierItem ¶

Bases: BaseModel

Represents a URL in the crawl frontier queue.

Used for pause/resume functionality and tracking crawl progress.

mark_in_progress ¶

Python

mark_in_progress() -> None

Mark item as being crawled.

Source code in src/ragcrawl/models/frontier_item.py

Python
def mark_in_progress(self) -> None:
    """Mark item as being crawled."""
    self.status = FrontierStatus.IN_PROGRESS
    self.started_at = datetime.now()

mark_completed ¶

Python

mark_completed() -> None

Mark item as successfully crawled.

Source code in src/ragcrawl/models/frontier_item.py

Python
def mark_completed(self) -> None:
    """Mark item as successfully crawled."""
    self.status = FrontierStatus.COMPLETED
    self.completed_at = datetime.now()

mark_failed ¶

Python

mark_failed(error: str) -> None

Mark item as failed.

Source code in src/ragcrawl/models/frontier_item.py

Python
def mark_failed(self, error: str) -> None:
    """Mark item as failed."""
    self.status = FrontierStatus.FAILED
    self.last_error = error
    self.retry_count += 1
    self.completed_at = datetime.now()

mark_skipped ¶

Python

mark_skipped(reason: str) -> None

Mark item as skipped.

Source code in src/ragcrawl/models/frontier_item.py

Python
def mark_skipped(self, reason: str) -> None:
    """Mark item as skipped."""
    self.status = FrontierStatus.SKIPPED
    self.last_error = reason
    self.completed_at = datetime.now()

Models¶

Overview¶

Model Relationships¶

Quick Reference¶

Document¶

Site¶

Chunk¶

Module Reference¶

Document ¶

Site ¶

Page ¶

needs_recrawl ¶

PageVersion ¶

Chunk ¶

is_first property ¶

is_last property ¶

CrawlRun ¶

duration_seconds property ¶

mark_started ¶

mark_completed ¶

mark_failed ¶

mark_cancelled ¶

FrontierItem ¶

mark_in_progress ¶

mark_completed ¶

mark_failed ¶

mark_skipped ¶

is_first `property` ¶

is_last `property` ¶

duration_seconds `property` ¶