Page¶

The Page model tracks the state and freshness of crawled pages.

Overview¶

Page is an internal model that tracks:

Page URL and identification
Current content version
Freshness information (ETags, Last-Modified)
Crawl metadata (depth, status, errors)

Usage¶

Accessing Pages¶

Python

from ragcrawl.storage import create_storage_backend

backend = create_storage_backend(storage_config)
backend.initialize()

# Get a specific page
page = backend.get_page(page_id)
print(f"URL: {page.url}")
print(f"Last crawled: {page.last_crawled}")
print(f"Status: {page.status_code}")

# Get page by URL
page = backend.get_page_by_url(site_id, url)

# List all pages for a site
pages = backend.list_pages(site_id)
for page in pages:
    print(f"{page.url}: {page.status_code}")

Page Freshness¶

Python

# Check if page needs re-crawling
from datetime import datetime, timezone, timedelta

max_age = timedelta(hours=24)
if page.last_crawled < datetime.now(timezone.utc) - max_age:
    print(f"Page {page.url} needs refresh")

# Check using conditional request headers
if page.etag:
    headers = {"If-None-Match": page.etag}
if page.last_modified:
    headers = {"If-Modified-Since": page.last_modified}

Tombstone Pages¶

Python

# Check if page was deleted
if page.is_tombstone:
    print(f"Page {page.url} was removed from site")

# Get pages including tombstones
pages = backend.list_pages(site_id, include_tombstones=True)

Fields¶

Field	Type	Description
`page_id`	str	Unique page identifier
`site_id`	str	Parent site ID
`url`	str	Page URL
`canonical_url`	str	Canonical URL if different
`current_version_id`	str	Current content version
`content_hash`	str	Content hash for change detection
`etag`	str	HTTP ETag header
`last_modified`	str	HTTP Last-Modified header
`first_seen`	datetime	First crawl time
`last_seen`	datetime	Last seen in crawl
`last_crawled`	datetime	Last successful crawl
`last_changed`	datetime	Last content change
`depth`	int	Link depth from seed
`referrer_url`	str	Referring page URL
`status_code`	int	Last HTTP status
`is_tombstone`	bool	Page was deleted
`error_count`	int	Consecutive errors
`last_error`	str	Last error message
`version_count`	int	Total versions

API Reference¶

Page ¶

Bases: BaseModel

Represents the current state of a URL in the crawl database.

This model tracks freshness information and points to the current version of the page content. It's used for incremental sync to determine what needs re-crawling.

needs_recrawl ¶

Python

needs_recrawl(
    max_age_hours: float | None = None, force: bool = False
) -> bool

Determine if this page needs to be re-crawled.

PARAMETER	DESCRIPTION
`max_age_hours`	Maximum age in hours before recrawl. None means always recrawl. TYPE: `float \| None` DEFAULT: `None`
`force`	If True, always return True. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`bool`	True if the page should be re-crawled.

Source code in src/ragcrawl/models/page.py

Python
def needs_recrawl(
    self,
    max_age_hours: float | None = None,
    force: bool = False,
) -> bool:
    """
    Determine if this page needs to be re-crawled.

    Args:
        max_age_hours: Maximum age in hours before recrawl. None means always recrawl.
        force: If True, always return True.

    Returns:
        True if the page should be re-crawled.
    """
    if force:
        return True

    if self.is_tombstone:
        return False

    if self.last_crawled is None:
        return True

    if max_age_hours is None:
        return True

    age = datetime.now() - self.last_crawled
    return age.total_seconds() / 3600 > max_age_hours