Skip to content

Page

The Page model tracks the state and freshness of crawled pages.

Overview

Page is an internal model that tracks:

  • Page URL and identification
  • Current content version
  • Freshness information (ETags, Last-Modified)
  • Crawl metadata (depth, status, errors)

Usage

Accessing Pages

Python
from ragcrawl.storage import create_storage_backend

backend = create_storage_backend(storage_config)
backend.initialize()

# Get a specific page
page = backend.get_page(page_id)
print(f"URL: {page.url}")
print(f"Last crawled: {page.last_crawled}")
print(f"Status: {page.status_code}")

# Get page by URL
page = backend.get_page_by_url(site_id, url)

# List all pages for a site
pages = backend.list_pages(site_id)
for page in pages:
    print(f"{page.url}: {page.status_code}")

Page Freshness

Python
# Check if page needs re-crawling
from datetime import datetime, timezone, timedelta

max_age = timedelta(hours=24)
if page.last_crawled < datetime.now(timezone.utc) - max_age:
    print(f"Page {page.url} needs refresh")

# Check using conditional request headers
if page.etag:
    headers = {"If-None-Match": page.etag}
if page.last_modified:
    headers = {"If-Modified-Since": page.last_modified}

Tombstone Pages

Python
# Check if page was deleted
if page.is_tombstone:
    print(f"Page {page.url} was removed from site")

# Get pages including tombstones
pages = backend.list_pages(site_id, include_tombstones=True)

Fields

Field Type Description
page_id str Unique page identifier
site_id str Parent site ID
url str Page URL
canonical_url str Canonical URL if different
current_version_id str Current content version
content_hash str Content hash for change detection
etag str HTTP ETag header
last_modified str HTTP Last-Modified header
first_seen datetime First crawl time
last_seen datetime Last seen in crawl
last_crawled datetime Last successful crawl
last_changed datetime Last content change
depth int Link depth from seed
referrer_url str Referring page URL
status_code int Last HTTP status
is_tombstone bool Page was deleted
error_count int Consecutive errors
last_error str Last error message
version_count int Total versions

API Reference

Page

Bases: BaseModel

Represents the current state of a URL in the crawl database.

This model tracks freshness information and points to the current version of the page content. It's used for incremental sync to determine what needs re-crawling.

needs_recrawl

Python
needs_recrawl(
    max_age_hours: float | None = None, force: bool = False
) -> bool

Determine if this page needs to be re-crawled.

PARAMETER DESCRIPTION
max_age_hours

Maximum age in hours before recrawl. None means always recrawl.

TYPE: float | None DEFAULT: None

force

If True, always return True.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
bool

True if the page should be re-crawled.

Source code in src/ragcrawl/models/page.py
Python
def needs_recrawl(
    self,
    max_age_hours: float | None = None,
    force: bool = False,
) -> bool:
    """
    Determine if this page needs to be re-crawled.

    Args:
        max_age_hours: Maximum age in hours before recrawl. None means always recrawl.
        force: If True, always return True.

    Returns:
        True if the page should be re-crawled.
    """
    if force:
        return True

    if self.is_tombstone:
        return False

    if self.last_crawled is None:
        return True

    if max_age_hours is None:
        return True

    age = datetime.now() - self.last_crawled
    return age.total_seconds() / 3600 > max_age_hours