Document¶

The Document model represents crawled page content ready for output and processing.

Overview¶

Document is the primary output model that contains:

Cleaned markdown content
Page metadata (title, description)
Source information (URL, status code)
Crawl context (run ID, site ID, timestamps)

Usage¶

Creating Documents¶

Documents are typically created by the crawl process, but you can create them manually:

Python

from datetime import datetime, timezone
from ragcrawl.models import Document

doc = Document(
    doc_id="abc123",
    page_id="abc123",
    source_url="https://example.com/guide",
    normalized_url="https://example.com/guide",
    markdown="# User Guide\n\nWelcome to the guide...",
    title="User Guide",
    description="Complete user guide for the product",
    status_code=200,
    content_type="text/html",
    depth=1,
    run_id="run_xyz789",
    site_id="site_abc123",
    first_seen=datetime.now(timezone.utc),
    last_seen=datetime.now(timezone.utc),
    last_crawled=datetime.now(timezone.utc),
)

Accessing Content¶

Python

# Get the markdown content
print(doc.markdown)

# Get metadata
print(f"Title: {doc.title}")
print(f"Description: {doc.description}")
print(f"Word count: {doc.word_count}")

# Get source info
print(f"URL: {doc.source_url}")
print(f"Status: {doc.status_code}")

Document Properties¶

Python

# Check if document is valid
if doc.status_code == 200 and not doc.is_tombstone:
    process_document(doc)

# Check content type
if doc.content_type.startswith("text/html"):
    # Process as HTML-derived content
    pass

Fields¶

Field	Type	Description
`doc_id`	str	Unique document identifier
`page_id`	str	Associated page ID
`source_url`	str	Original URL
`normalized_url`	str	Normalized URL for deduplication
`markdown`	str	Cleaned markdown content
`title`	str	Page title
`description`	str	Page description/meta
`status_code`	int	HTTP status code
`content_type`	str	Content MIME type
`language`	str	Detected language
`depth`	int	Crawl depth from seed
`word_count`	int	Word count
`char_count`	int	Character count
`outlinks`	list[str]	Outbound links
`run_id`	str	Crawl run ID
`site_id`	str	Site ID
`first_seen`	datetime	First crawl time
`last_seen`	datetime	Last seen time
`last_crawled`	datetime	Last crawl time
`is_tombstone`	bool	Deleted page marker

API Reference¶

Document ¶

Bases: BaseModel

A crawled document with rich metadata for LLM/RAG consumption.

This is the primary output model containing all extracted content and metadata from a crawled page.

doc_id `class-attribute` `instance-attribute` ¶

Python

doc_id: str = Field(
    description="Stable ID: hash(normalized_url)"
)

page_id `class-attribute` `instance-attribute` ¶

Python

page_id: str = Field(
    description="Alias for doc_id for compatibility"
)

source_url `class-attribute` `instance-attribute` ¶

Python

source_url: str = Field(
    description="Original URL as discovered"
)

normalized_url `class-attribute` `instance-attribute` ¶

Python

normalized_url: str = Field(
    description="Normalized/canonical URL"
)

markdown `class-attribute` `instance-attribute` ¶

Python

markdown: str = Field(
    description="Extracted clean Markdown content"
)

title `class-attribute` `instance-attribute` ¶

Python

title: str | None = Field(
    default=None, description="Page title"
)

description `class-attribute` `instance-attribute` ¶

Python

description: str | None = Field(
    default=None, description="Meta description"
)

status_code `class-attribute` `instance-attribute` ¶

Python

status_code: int = Field(description='HTTP status code')

content_type `class-attribute` `instance-attribute` ¶

Python

content_type: str | None = Field(
    default=None, description="HTTP Content-Type"
)

Document¶

Overview¶

Usage¶

Creating Documents¶

Accessing Content¶

Document Properties¶

Fields¶

API Reference¶

Document ¶

doc_id class-attribute instance-attribute ¶

page_id class-attribute instance-attribute ¶

source_url class-attribute instance-attribute ¶

normalized_url class-attribute instance-attribute ¶

markdown class-attribute instance-attribute ¶

title class-attribute instance-attribute ¶

description class-attribute instance-attribute ¶

status_code class-attribute instance-attribute ¶

content_type class-attribute instance-attribute ¶

doc_id `class-attribute` `instance-attribute` ¶

page_id `class-attribute` `instance-attribute` ¶

source_url `class-attribute` `instance-attribute` ¶

normalized_url `class-attribute` `instance-attribute` ¶

markdown `class-attribute` `instance-attribute` ¶

title `class-attribute` `instance-attribute` ¶

description `class-attribute` `instance-attribute` ¶

status_code `class-attribute` `instance-attribute` ¶

content_type `class-attribute` `instance-attribute` ¶