Models¶
Data models used throughout ragcrawl.
Overview¶
ragcrawl uses Pydantic models for type-safe data handling:
| Model | Description |
|---|---|
| Document | Crawled page content for output |
| Site | Crawl configuration and metadata |
| Page | Page state and freshness tracking |
| Chunk | Content chunks for embeddings |
PageVersion |
Versioned content snapshots |
CrawlRun |
Crawl execution records |
FrontierItem |
URL queue items |
Model Relationships¶
erDiagram
Site ||--o{ CrawlRun : has
Site ||--o{ Page : contains
Page ||--o{ PageVersion : versions
CrawlRun ||--o{ PageVersion : creates
CrawlRun ||--o{ FrontierItem : manages
PageVersion ||--|| Document : exports_to
Document ||--o{ Chunk : chunks_into
Quick Reference¶
Document¶
The output model for crawled content:
from ragcrawl.models import Document
doc = Document(
doc_id="abc123",
page_id="abc123",
source_url="https://example.com/page",
normalized_url="https://example.com/page",
markdown="# Page Title\n\nContent here...",
title="Page Title",
status_code=200,
content_type="text/html",
depth=1,
run_id="run_xyz",
site_id="site_abc",
first_seen=datetime.now(),
last_seen=datetime.now(),
last_crawled=datetime.now(),
)
Site¶
Configuration snapshot for a crawled site:
from ragcrawl.models import Site
site = Site(
site_id="site_abc123",
name="Example Docs",
seeds=["https://docs.example.com"],
allowed_domains=["docs.example.com"],
created_at=datetime.now(),
updated_at=datetime.now(),
)
Chunk¶
Content chunk for embeddings:
from ragcrawl.models import Chunk
chunk = Chunk(
chunk_id="chunk_001",
doc_id="abc123",
content="This is the chunk content...",
chunk_index=0,
total_chunks=5,
section_path=["Introduction", "Overview"],
heading="Overview",
token_estimate=150,
)
Module Reference¶
Data models for ragcrawl.
Document
¶
Bases: BaseModel
A crawled document with rich metadata for LLM/RAG consumption.
This is the primary output model containing all extracted content and metadata from a crawled page.
Site
¶
Bases: BaseModel
Represents a website/crawl target with its configuration.
Stores the configuration snapshot and metadata for a crawl target.
Page
¶
Bases: BaseModel
Represents the current state of a URL in the crawl database.
This model tracks freshness information and points to the current version of the page content. It's used for incremental sync to determine what needs re-crawling.
needs_recrawl
¶
Determine if this page needs to be re-crawled.
| PARAMETER | DESCRIPTION |
|---|---|
max_age_hours
|
Maximum age in hours before recrawl. None means always recrawl.
TYPE:
|
force
|
If True, always return True.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if the page should be re-crawled. |
Source code in src/ragcrawl/models/page.py
PageVersion
¶
Bases: BaseModel
Represents a specific version of a page's content.
Each time content changes, a new PageVersion is created. This enables version history and change tracking for KB updates.
Chunk
¶
Bases: BaseModel
Represents a chunk of content for RAG/embedding pipelines.
Chunks are segments of a document optimized for vector embedding and retrieval, with metadata for context reconstruction.