API Reference¶
Complete Python API documentation for ragcrawl.
Overview¶
ragcrawl is organized into the following modules:
| Module | Description |
|---|---|
| Core | Main entry points (CrawlJob, SyncJob) |
| Models | Data models (Document, Page, Chunk, etc.) |
| Storage | Storage backends (DuckDB, DynamoDB) |
| Chunking | Content chunking (HeadingChunker, TokenChunker) |
| Export | Export and publishing (JSON, Markdown) |
| Filters | URL filtering and normalization |
Quick Reference¶
Crawling¶
Python
import asyncio
from ragcrawl import CrawlJob
from ragcrawl.config import CrawlerConfig
config = CrawlerConfig(
seeds=["https://docs.example.com"],
max_pages=100,
max_depth=5,
)
job = CrawlJob(config)
result = asyncio.run(job.run())
print(f"Crawled {result.stats.pages_crawled} pages")
Syncing¶
Python
import asyncio
from ragcrawl import SyncJob
from ragcrawl.config import SyncConfig
config = SyncConfig(
site_id="site_abc123",
use_sitemap=True,
)
job = SyncJob(config)
result = asyncio.run(job.run())
print(f"Updated {result.stats.pages_changed} pages")
Chunking¶
Python
from ragcrawl.chunking import HeadingChunker
chunker = HeadingChunker(max_tokens=500)
chunks = chunker.chunk_documents(result.documents)
Exporting¶
Python
from ragcrawl.export import JSONLExporter
from pathlib import Path
exporter = JSONLExporter()
exporter.export_documents(result.documents, Path("output.jsonl"))
Module Documentation¶
-
CrawlJobandSyncJob- main entry points for crawling operations. -
Data models:
Document,Page,PageVersion,Chunk,Site,CrawlRun -
Storage backends:
DuckDBBackend,DynamoDBBackend -
Content chunkers:
HeadingChunker,TokenChunker -
Exporters and publishers:
JSONExporter,SinglePagePublisher,MultiPagePublisher -
URL filtering:
LinkFilter,PatternMatcher,URLNormalizer
Configuration Classes¶
| Class | Module | Description |
|---|---|---|
CrawlerConfig |
ragcrawl.config |
Crawl settings |
SyncConfig |
ragcrawl.config |
Sync settings |
StorageConfig |
ragcrawl.config |
Storage backend config |
OutputConfig |
ragcrawl.config |
Output format config |
DuckDBConfig |
ragcrawl.config.storage_config |
DuckDB settings |
DynamoDBConfig |
ragcrawl.config.storage_config |
DynamoDB settings |
Imports¶
Python
# Main classes
from ragcrawl import CrawlJob, SyncJob
from ragcrawl.config import CrawlerConfig, SyncConfig
# Storage
from ragcrawl.config.storage_config import StorageConfig, DuckDBConfig, DynamoDBConfig
from ragcrawl.storage import create_storage_backend
# Models
from ragcrawl.models import Document, Page, PageVersion, Chunk, Site, CrawlRun
# Chunking
from ragcrawl.chunking import HeadingChunker, TokenChunker
# Export
from ragcrawl.export import JSONExporter, JSONLExporter
from ragcrawl.output import SinglePagePublisher, MultiPagePublisher
# Filters
from ragcrawl.filters import LinkFilter, PatternMatcher, URLNormalizer
Type Hints¶
ragcrawl is fully typed. Enable type checking in your IDE:
Python
from ragcrawl import CrawlJob
from ragcrawl.config import CrawlerConfig
config: CrawlerConfig = CrawlerConfig(...)
job: CrawlJob = CrawlJob(config)
Async API¶
Core operations are async:
Python
import asyncio
async def main():
job = CrawlJob(config)
result = await job.run()
return result
result = asyncio.run(main())
Or use the sync wrapper: