API Reference¶

Complete Python API documentation for ragcrawl.

Overview¶

ragcrawl is organized into the following modules:

Module	Description
Core	Main entry points (CrawlJob, SyncJob)
Models	Data models (Document, Page, Chunk, etc.)
Storage	Storage backends (DuckDB, DynamoDB)
Chunking	Content chunking (HeadingChunker, TokenChunker)
Export	Export and publishing (JSON, Markdown)
Filters	URL filtering and normalization

Quick Reference¶

Crawling¶

Python

import asyncio
from ragcrawl import CrawlJob
from ragcrawl.config import CrawlerConfig

config = CrawlerConfig(
    seeds=["https://docs.example.com"],
    max_pages=100,
    max_depth=5,
)

job = CrawlJob(config)
result = asyncio.run(job.run())

print(f"Crawled {result.stats.pages_crawled} pages")

Syncing¶

Python

import asyncio
from ragcrawl import SyncJob
from ragcrawl.config import SyncConfig

config = SyncConfig(
    site_id="site_abc123",
    use_sitemap=True,
)

job = SyncJob(config)
result = asyncio.run(job.run())

print(f"Updated {result.stats.pages_changed} pages")

Chunking¶

Python

from ragcrawl.chunking import HeadingChunker

chunker = HeadingChunker(max_tokens=500)
chunks = chunker.chunk_documents(result.documents)

Exporting¶

Python

from ragcrawl.export import JSONLExporter
from pathlib import Path

exporter = JSONLExporter()
exporter.export_documents(result.documents, Path("output.jsonl"))

Module Documentation¶

Core

CrawlJob and SyncJob - main entry points for crawling operations.
Models

Data models: Document, Page, PageVersion, Chunk, Site, CrawlRun
Storage

Storage backends: DuckDBBackend, DynamoDBBackend
Chunking

Content chunkers: HeadingChunker, TokenChunker
Export

Exporters and publishers: JSONExporter, SinglePagePublisher, MultiPagePublisher
Filters

URL filtering: LinkFilter, PatternMatcher, URLNormalizer

Configuration Classes¶

Class	Module	Description
`CrawlerConfig`	`ragcrawl.config`	Crawl settings
`SyncConfig`	`ragcrawl.config`	Sync settings
`StorageConfig`	`ragcrawl.config`	Storage backend config
`OutputConfig`	`ragcrawl.config`	Output format config
`DuckDBConfig`	`ragcrawl.config.storage_config`	DuckDB settings
`DynamoDBConfig`	`ragcrawl.config.storage_config`	DynamoDB settings

Imports¶

Python

# Main classes
from ragcrawl import CrawlJob, SyncJob
from ragcrawl.config import CrawlerConfig, SyncConfig

# Storage
from ragcrawl.config.storage_config import StorageConfig, DuckDBConfig, DynamoDBConfig
from ragcrawl.storage import create_storage_backend

# Models
from ragcrawl.models import Document, Page, PageVersion, Chunk, Site, CrawlRun

# Chunking
from ragcrawl.chunking import HeadingChunker, TokenChunker

# Export
from ragcrawl.export import JSONExporter, JSONLExporter
from ragcrawl.output import SinglePagePublisher, MultiPagePublisher

# Filters
from ragcrawl.filters import LinkFilter, PatternMatcher, URLNormalizer

Type Hints¶

ragcrawl is fully typed. Enable type checking in your IDE:

Python

from ragcrawl import CrawlJob
from ragcrawl.config import CrawlerConfig

config: CrawlerConfig = CrawlerConfig(...)
job: CrawlJob = CrawlJob(config)

Async API¶

Core operations are async:

Python

import asyncio

async def main():
    job = CrawlJob(config)
    result = await job.run()
    return result

result = asyncio.run(main())

Or use the sync wrapper:

Python

from ragcrawl import CrawlJob

job = CrawlJob(config)
result = asyncio.run(job.run())