Export API¶
ragcrawl provides exporters and publishers for outputting crawled content.
Overview¶
Exporters¶
Export data to structured formats:
| Exporter | Format | Description |
|---|---|---|
JSONExporter |
.json | Single JSON array |
JSONLExporter |
.jsonl | JSON Lines format |
Publishers¶
Output markdown files:
| Publisher | Output | Description |
|---|---|---|
SinglePagePublisher |
One file | Combined markdown |
MultiPagePublisher |
Directory | Preserves structure |
JSONExporter¶
Export documents to a single JSON file.
from ragcrawl.export import JSONExporter
from pathlib import Path
exporter = JSONExporter(indent=2)
# Export documents
exporter.export_documents(documents, Path("output.json"))
# Export chunks
exporter.export_chunks(chunks, Path("chunks.json"))
Output Format¶
[
{
"doc_id": "abc123",
"source_url": "https://example.com/page",
"title": "Page Title",
"markdown": "# Page Title\n\nContent...",
"status_code": 200,
"word_count": 500
}
]
JSONLExporter¶
Export documents to JSON Lines format (one JSON object per line).
from ragcrawl.export import JSONLExporter
from pathlib import Path
exporter = JSONLExporter()
# Export documents
exporter.export_documents(documents, Path("output.jsonl"))
Output Format¶
{"doc_id":"abc123","source_url":"https://example.com/page1",...}
{"doc_id":"def456","source_url":"https://example.com/page2",...}
SinglePagePublisher¶
Combine all documents into a single markdown file.
from ragcrawl.output import SinglePagePublisher
from ragcrawl.config import OutputConfig, OutputMode
config = OutputConfig(
mode=OutputMode.SINGLE,
root_dir="./output",
single_file_name="knowledge_base.md",
include_toc=True,
include_metadata=True,
)
publisher = SinglePagePublisher(config)
files = publisher.publish(documents)
print(f"Created: {files[0]}")
Configuration¶
| Option | Type | Default | Description |
|---|---|---|---|
single_file_name |
str | "output.md" | Output filename |
include_toc |
bool | True | Add table of contents |
include_metadata |
bool | True | Add source URLs |
MultiPagePublisher¶
Output documents preserving URL structure.
from ragcrawl.output import MultiPagePublisher
from ragcrawl.config import OutputConfig, OutputMode
config = OutputConfig(
mode=OutputMode.MULTI,
root_dir="./output",
rewrite_links=True,
generate_index=True,
include_metadata=True,
)
publisher = MultiPagePublisher(config)
files = publisher.publish(documents)
print(f"Created {len(files)} files")
Output Structure¶
output/
├── index.md
├── example.com/
│ ├── docs/
│ │ ├── getting-started.md
│ │ └── api-reference.md
│ └── blog/
│ └── post-1.md
Configuration¶
| Option | Type | Default | Description |
|---|---|---|---|
root_dir |
str | "./output" | Output directory |
rewrite_links |
bool | True | Rewrite internal links |
generate_index |
bool | True | Create index file |
include_metadata |
bool | True | Add frontmatter |
Output Configuration¶
from ragcrawl.config import OutputConfig, OutputMode
config = OutputConfig(
mode=OutputMode.MULTI, # or SINGLE
root_dir="./output",
# Single-page options
single_file_name="combined.md",
include_toc=True,
# Multi-page options
rewrite_links=True,
generate_index=True,
# Common options
include_metadata=True,
)
Module Reference¶
Export functionality for ragcrawl.
JSONExporter
¶
JSONExporter(
indent: int | None = 2,
include_html: bool = False,
include_diagnostics: bool = True,
)
Bases: Exporter
Exports documents and chunks as JSON.
Initialize JSON exporter.
| PARAMETER | DESCRIPTION |
|---|---|
indent
|
JSON indentation (None for compact).
TYPE:
|
include_html
|
Include HTML content in export.
TYPE:
|
include_diagnostics
|
Include diagnostic info.
TYPE:
|
Source code in src/ragcrawl/export/json_exporter.py
export_document
¶
export_document(
document: Document, path: Path | None = None
) -> str | None
Export a document as JSON.
Source code in src/ragcrawl/export/json_exporter.py
export_documents
¶
export_documents(
documents: list[Document], path: Path
) -> None
Export documents as JSON array.
Source code in src/ragcrawl/export/json_exporter.py
export_chunk
¶
export_chunk(
chunk: Chunk, path: Path | None = None
) -> str | None
Export a chunk as JSON.
Source code in src/ragcrawl/export/json_exporter.py
export_chunks
¶
export_chunks(chunks: list[Chunk], path: Path) -> None
Export chunks as JSON array.
Source code in src/ragcrawl/export/json_exporter.py
JSONLExporter
¶
Bases: Exporter
Exports documents and chunks as JSONL (one JSON object per line).
JSONL is better for streaming and large datasets.
Initialize JSONL exporter.
| PARAMETER | DESCRIPTION |
|---|---|
include_html
|
Include HTML content.
TYPE:
|
include_diagnostics
|
Include diagnostics.
TYPE:
|
Source code in src/ragcrawl/export/json_exporter.py
export_document
¶
export_document(
document: Document, path: Path | None = None
) -> str | None
export_documents
¶
export_documents(
documents: list[Document], path: Path
) -> None
Export documents as JSONL file.
Source code in src/ragcrawl/export/json_exporter.py
| Python | |
|---|---|
Output publishing for ragcrawl.
SinglePagePublisher
¶
Bases: MarkdownPublisher
Publishes all documents to a single markdown file.
Features: - Auto-generated table of contents - Per-page anchors for navigation - Configurable page separators
Source code in src/ragcrawl/output/publisher.py
publish
¶
publish(documents: list[Document]) -> list[Path]
Publish all documents to a single file.
| PARAMETER | DESCRIPTION |
|---|---|
documents
|
Documents to publish.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[Path]
|
List containing the single output file path. |
Source code in src/ragcrawl/output/single_page.py
MultiPagePublisher
¶
Bases: MarkdownPublisher
Publishes documents as individual markdown files.
Features: - Preserves site folder structure - Rewrites internal links to local markdown files - Generates navigation aids (index, breadcrumbs, prev/next) - Handles deleted pages via tombstones or redirects
Initialize multi-page publisher.
Source code in src/ragcrawl/output/multi_page.py
publish
¶
publish(documents: list[Document]) -> list[Path]
Publish documents as individual files.
| PARAMETER | DESCRIPTION |
|---|---|
documents
|
Documents to publish.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[Path]
|
List of created file paths. |