Skip to content

Output Configuration

Configure how crawled content is published.

Overview

ragcrawl can output content in two modes:

  • Multi-page: Each page becomes a separate Markdown file
  • Single-page: All content combined into one file

OutputConfig

Python
from ragcrawl.config.output_config import OutputConfig, OutputMode

config = OutputConfig(
    mode=OutputMode.MULTI,
    root_dir="./output",
)

Options

Option Type Default Description
mode OutputMode MULTI SINGLE or MULTI
root_dir str "./output" Output directory
include_metadata bool True Include source URLs in output
include_toc bool True Include table of contents (SINGLE mode)
rewrite_links bool True Rewrite internal links (MULTI mode)
single_file_name str "knowledge_base.md" Output filename (SINGLE mode)
generate_index bool True Generate index.md (MULTI mode)

Multi-Page Output

Each crawled page becomes a separate Markdown file:

Python
config = OutputConfig(
    mode=OutputMode.MULTI,
    root_dir="./docs-output",
    include_metadata=True,
    rewrite_links=True,
    generate_index=True,
)

Output Structure

Text Only
docs-output/
├── index.md                 # Generated index
├── example.com/
│   ├── index.md
│   ├── docs/
│   │   ├── getting-started.md
│   │   ├── installation.md
│   │   └── api/
│   │       ├── overview.md
│   │       └── reference.md
│   └── blog/
│       ├── post-1.md
│       └── post-2.md

Internal links are automatically converted:

Original HTML:

HTML
<a href="/docs/api/overview">API Overview</a>

Output Markdown:

Markdown
[API Overview](./api/overview.md)

Metadata Headers

Each file includes source information:

Markdown
<!-- Source: https://example.com/docs/getting-started -->
<!-- Crawled: 2024-01-15T10:30:00Z -->

# Getting Started

Content here...

Single-Page Output

All content combined into one file:

Python
config = OutputConfig(
    mode=OutputMode.SINGLE,
    root_dir="./output",
    single_file_name="knowledge_base.md",
    include_toc=True,
    include_metadata=True,
)

Output Structure

Text Only
output/
└── knowledge_base.md

File Format

Markdown
# Knowledge Base

Generated from https://docs.example.com

## Table of Contents

- [Getting Started](#getting-started)
- [Installation](#installation)
- [Configuration](#configuration)

---

## Getting Started

<!-- Source: https://example.com/docs/getting-started -->

Content here...

---

## Installation

<!-- Source: https://example.com/docs/installation -->

Content here...

Custom Publishers

Create custom output formats:

Python
from ragcrawl.output.publisher import MarkdownPublisher
from ragcrawl.config.output_config import OutputConfig
from pathlib import Path

class MyCustomPublisher(MarkdownPublisher):
    def publish(self, documents: list) -> list[Path]:
        output_files = []

        for doc in documents:
            # Custom processing
            content = self.format_document(doc)

            # Custom filename
            filename = f"{doc.doc_id}.md"
            path = Path(self.config.root_dir) / filename

            path.parent.mkdir(parents=True, exist_ok=True)
            path.write_text(content)
            output_files.append(path)

        return output_files

    def format_document(self, doc):
        return f"""---
title: {doc.title}
url: {doc.url}
date: {doc.fetched_at.isoformat()}
---

{doc.content}
"""

CLI Options

Bash
# Multi-page output (default)
ragcrawl crawl https://example.com --output ./output --output-mode multi

# Single-page output
ragcrawl crawl https://example.com --output ./output --output-mode single

Best Practices

  1. Multi-page for large sites: Easier to navigate and search
  2. Single-page for LLM context: One file for full-text RAG
  3. Enable link rewriting: Keeps navigation working offline
  4. Include metadata: Helps trace content back to sources
  5. Use consistent structure: Match the site's URL hierarchy