Skip to content

Document

The Document model represents crawled page content ready for output and processing.

Overview

Document is the primary output model that contains:

  • Cleaned markdown content
  • Page metadata (title, description)
  • Source information (URL, status code)
  • Crawl context (run ID, site ID, timestamps)

Usage

Creating Documents

Documents are typically created by the crawl process, but you can create them manually:

Python
from datetime import datetime, timezone
from ragcrawl.models import Document

doc = Document(
    doc_id="abc123",
    page_id="abc123",
    source_url="https://example.com/guide",
    normalized_url="https://example.com/guide",
    markdown="# User Guide\n\nWelcome to the guide...",
    title="User Guide",
    description="Complete user guide for the product",
    status_code=200,
    content_type="text/html",
    depth=1,
    run_id="run_xyz789",
    site_id="site_abc123",
    first_seen=datetime.now(timezone.utc),
    last_seen=datetime.now(timezone.utc),
    last_crawled=datetime.now(timezone.utc),
)

Accessing Content

Python
# Get the markdown content
print(doc.markdown)

# Get metadata
print(f"Title: {doc.title}")
print(f"Description: {doc.description}")
print(f"Word count: {doc.word_count}")

# Get source info
print(f"URL: {doc.source_url}")
print(f"Status: {doc.status_code}")

Document Properties

Python
# Check if document is valid
if doc.status_code == 200 and not doc.is_tombstone:
    process_document(doc)

# Check content type
if doc.content_type.startswith("text/html"):
    # Process as HTML-derived content
    pass

Fields

Field Type Description
doc_id str Unique document identifier
page_id str Associated page ID
source_url str Original URL
normalized_url str Normalized URL for deduplication
markdown str Cleaned markdown content
title str Page title
description str Page description/meta
status_code int HTTP status code
content_type str Content MIME type
language str Detected language
depth int Crawl depth from seed
word_count int Word count
char_count int Character count
outlinks list[str] Outbound links
run_id str Crawl run ID
site_id str Site ID
first_seen datetime First crawl time
last_seen datetime Last seen time
last_crawled datetime Last crawl time
is_tombstone bool Deleted page marker

API Reference

Document

Bases: BaseModel

A crawled document with rich metadata for LLM/RAG consumption.

This is the primary output model containing all extracted content and metadata from a crawled page.

doc_id class-attribute instance-attribute

Python
doc_id: str = Field(
    description="Stable ID: hash(normalized_url)"
)

page_id class-attribute instance-attribute

Python
page_id: str = Field(
    description="Alias for doc_id for compatibility"
)

source_url class-attribute instance-attribute

Python
source_url: str = Field(
    description="Original URL as discovered"
)

normalized_url class-attribute instance-attribute

Python
normalized_url: str = Field(
    description="Normalized/canonical URL"
)

markdown class-attribute instance-attribute

Python
markdown: str = Field(
    description="Extracted clean Markdown content"
)

title class-attribute instance-attribute

Python
title: str | None = Field(
    default=None, description="Page title"
)

description class-attribute instance-attribute

Python
description: str | None = Field(
    default=None, description="Meta description"
)

status_code class-attribute instance-attribute

Python
status_code: int = Field(description='HTTP status code')

content_type class-attribute instance-attribute

Python
content_type: str | None = Field(
    default=None, description="HTTP Content-Type"
)