Skip to content

Chunking API

ragcrawl provides chunking utilities to prepare content for embedding models.

Overview

Chunker Description Use Case
HeadingChunker Splits by markdown headings Preserve document structure
TokenChunker Splits by token count Fixed-size chunks for embeddings

HeadingChunker

Splits content at heading boundaries while respecting token limits.

Usage

Python
from ragcrawl.chunking import HeadingChunker
from ragcrawl.models import Document

chunker = HeadingChunker(
    max_tokens=500,
    min_chunk_chars=100,
    heading_levels=[1, 2, 3],
)

# Chunk a single document
chunks = chunker.chunk(document)

# Chunk multiple documents
all_chunks = chunker.chunk_documents(documents)

for chunk in chunks:
    print(f"Section: {' > '.join(chunk.section_path)}")
    print(f"Tokens: ~{chunk.token_estimate}")
    print(chunk.content[:200])
    print()

Configuration

Option Type Default Description
max_tokens int 500 Maximum tokens per chunk
min_chunk_chars int 100 Minimum characters for a chunk
heading_levels list[int] [1,2,3] Heading levels to split on
include_heading bool True Include heading in chunk
preserve_code_blocks bool True Keep code blocks intact

TokenChunker

Splits content into fixed-size chunks by token count.

Usage

Python
from ragcrawl.chunking import TokenChunker

chunker = TokenChunker(
    max_tokens=500,
    overlap_tokens=50,
    encoding_name="cl100k_base",
)

chunks = chunker.chunk(document)

for chunk in chunks:
    print(f"Chunk {chunk.chunk_index + 1}/{chunk.total_chunks}")
    print(f"Tokens: {chunk.token_estimate}")

Configuration

Option Type Default Description
max_tokens int 500 Maximum tokens per chunk
overlap_tokens int 50 Overlap between chunks
encoding_name str "cl100k_base" Tokenizer encoding

Choosing a Chunker

Use HeadingChunker when:

  • Document has clear heading structure
  • Semantic boundaries are important
  • Context preservation matters
  • Building hierarchical indexes

Use TokenChunker when:

  • Fixed chunk sizes needed
  • Document lacks clear structure
  • Maximum control over chunk boundaries
  • Optimizing for specific embedding models

Custom Chunking

Implement the Chunker protocol for custom logic:

Python
from ragcrawl.chunking import Chunker
from ragcrawl.models import Document, Chunk

class CustomChunker(Chunker):
    def chunk(self, document: Document) -> list[Chunk]:
        chunks = []
        # Your chunking logic here
        return chunks

    def chunk_documents(self, documents: list[Document]) -> list[Chunk]:
        all_chunks = []
        for doc in documents:
            all_chunks.extend(self.chunk(doc))
        return all_chunks

Integration with Embedding APIs

Python
from ragcrawl.chunking import HeadingChunker
import openai

chunker = HeadingChunker(max_tokens=500)
chunks = chunker.chunk_documents(documents)

# Prepare texts with context
texts = []
for chunk in chunks:
    # Add section context as prefix
    prefix = " > ".join(chunk.section_path) if chunk.section_path else ""
    text = f"[{prefix}]\n\n{chunk.content}" if prefix else chunk.content
    texts.append(text)

# Get embeddings
response = openai.embeddings.create(
    model="text-embedding-3-small",
    input=texts,
)

# Store with metadata
for i, embedding in enumerate(response.data):
    store_vector(
        id=chunks[i].chunk_id,
        vector=embedding.embedding,
        metadata={
            "doc_id": chunks[i].doc_id,
            "section": chunks[i].section_path,
            "heading": chunks[i].heading,
        }
    )

Module Reference

Content chunking for RAG pipelines.

HeadingChunker

Python
HeadingChunker(
    min_chunk_size: int = 100,
    max_chunk_size: int = 2000,
    heading_levels: list[int] | None = None,
    include_heading_in_chunk: bool = True,
    overlap_size: int = 0,
)

Bases: Chunker

Chunks markdown content by headings.

Creates chunks that respect document structure by splitting at heading boundaries while maintaining context.

Initialize heading chunker.

PARAMETER DESCRIPTION
min_chunk_size

Minimum chunk size in characters.

TYPE: int DEFAULT: 100

max_chunk_size

Maximum chunk size in characters.

TYPE: int DEFAULT: 2000

heading_levels

Heading levels to split on (default: [1, 2, 3]).

TYPE: list[int] | None DEFAULT: None

include_heading_in_chunk

Include the heading in chunk content.

TYPE: bool DEFAULT: True

overlap_size

Characters to overlap between chunks.

TYPE: int DEFAULT: 0

Source code in src/ragcrawl/chunking/heading_chunker.py
Python
def __init__(
    self,
    min_chunk_size: int = 100,
    max_chunk_size: int = 2000,
    heading_levels: list[int] | None = None,
    include_heading_in_chunk: bool = True,
    overlap_size: int = 0,
) -> None:
    """
    Initialize heading chunker.

    Args:
        min_chunk_size: Minimum chunk size in characters.
        max_chunk_size: Maximum chunk size in characters.
        heading_levels: Heading levels to split on (default: [1, 2, 3]).
        include_heading_in_chunk: Include the heading in chunk content.
        overlap_size: Characters to overlap between chunks.
    """
    self.min_chunk_size = min_chunk_size
    self.max_chunk_size = max_chunk_size
    self.heading_levels = heading_levels or [1, 2, 3]
    self.include_heading_in_chunk = include_heading_in_chunk
    self.overlap_size = overlap_size

    # Regex for markdown headings
    self._heading_pattern = re.compile(
        r"^(#{1,6})\s+(.+?)$", re.MULTILINE
    )

chunk

Python
chunk(document: Document) -> list[Chunk]

Chunk document by headings.

PARAMETER DESCRIPTION
document

Document to chunk.

TYPE: Document

RETURNS DESCRIPTION
list[Chunk]

List of chunks.

Source code in src/ragcrawl/chunking/heading_chunker.py
Python
def chunk(self, document: Document) -> list[Chunk]:
    """
    Chunk document by headings.

    Args:
        document: Document to chunk.

    Returns:
        List of chunks.
    """
    content = document.markdown
    if not content:
        return []

    # Parse sections
    sections = self._parse_sections(content)

    if not sections:
        # No headings found, treat as single chunk
        return self._create_single_chunk(document, content)

    # Create chunks from sections
    chunks = []
    section_path_stack: list[str] = []

    for i, section in enumerate(sections):
        # Update section path
        while section_path_stack and len(section_path_stack) >= section.level:
            section_path_stack.pop()
        section_path_stack.append(section.heading)
        section_path = " > ".join(section_path_stack)

        # Build chunk content
        if self.include_heading_in_chunk:
            chunk_content = f"{'#' * section.level} {section.heading}\n\n{section.content}"
        else:
            chunk_content = section.content

        # Handle oversized sections
        if len(chunk_content) > self.max_chunk_size:
            sub_chunks = self._split_large_section(
                document,
                chunk_content,
                section,
                section_path,
                len(chunks),
            )
            chunks.extend(sub_chunks)
        elif len(chunk_content) >= self.min_chunk_size:
            chunk = self._create_chunk(
                document=document,
                content=chunk_content,
                index=len(chunks),
                start_offset=section.start_offset,
                end_offset=section.end_offset,
                section_path=section_path,
                heading=section.heading,
                heading_level=section.level,
            )
            chunks.append(chunk)
        # else: skip chunks that are too small

    # Update total_chunks
    for chunk in chunks:
        chunk.total_chunks = len(chunks)

    return chunks

estimate_tokens

Python
estimate_tokens(text: str) -> int

Estimate token count (roughly 4 chars per token).

Source code in src/ragcrawl/chunking/heading_chunker.py
Python
def estimate_tokens(self, text: str) -> int:
    """Estimate token count (roughly 4 chars per token)."""
    return len(text) // 4

TokenChunker

Python
TokenChunker(
    chunk_size: int = 512,
    chunk_overlap: int = 50,
    encoding_name: str = "cl100k_base",
    separators: list[str] | None = None,
)

Bases: Chunker

Chunks content by token count.

Uses tiktoken for accurate token counting and respects natural text boundaries (sentences, paragraphs).

Initialize token chunker.

PARAMETER DESCRIPTION
chunk_size

Target chunk size in tokens.

TYPE: int DEFAULT: 512

chunk_overlap

Token overlap between chunks.

TYPE: int DEFAULT: 50

encoding_name

Tiktoken encoding name.

TYPE: str DEFAULT: 'cl100k_base'

separators

Text separators to try, in order.

TYPE: list[str] | None DEFAULT: None

Source code in src/ragcrawl/chunking/token_chunker.py
Python
def __init__(
    self,
    chunk_size: int = 512,
    chunk_overlap: int = 50,
    encoding_name: str = "cl100k_base",
    separators: list[str] | None = None,
) -> None:
    """
    Initialize token chunker.

    Args:
        chunk_size: Target chunk size in tokens.
        chunk_overlap: Token overlap between chunks.
        encoding_name: Tiktoken encoding name.
        separators: Text separators to try, in order.
    """
    self.chunk_size = chunk_size
    self.chunk_overlap = chunk_overlap
    self.encoding_name = encoding_name
    self.separators = separators or ["\n\n", "\n", ". ", " "]

    self._encoding = None

encoding property

Python
encoding

Get or create tiktoken encoding.

chunk

Python
chunk(document: Document) -> list[Chunk]

Chunk document by token count.

PARAMETER DESCRIPTION
document

Document to chunk.

TYPE: Document

RETURNS DESCRIPTION
list[Chunk]

List of chunks.

Source code in src/ragcrawl/chunking/token_chunker.py
Python
def chunk(self, document: Document) -> list[Chunk]:
    """
    Chunk document by token count.

    Args:
        document: Document to chunk.

    Returns:
        List of chunks.
    """
    content = document.markdown
    if not content:
        return []

    # Split content into chunks
    text_chunks = self._split_text(content)

    # Create Chunk objects
    chunks = []
    current_offset = 0

    for i, text in enumerate(text_chunks):
        # Find actual offset in original content
        start_offset = content.find(text[:50], current_offset)
        if start_offset == -1:
            start_offset = current_offset
        end_offset = start_offset + len(text)
        current_offset = end_offset - self.chunk_overlap * 4  # Approximate

        chunk = Chunk(
            chunk_id=generate_chunk_id(document.doc_id, i),
            doc_id=document.doc_id,
            page_id=document.page_id,
            version_id=document.version_id,
            content=text,
            content_type="markdown",
            chunk_index=i,
            total_chunks=len(text_chunks),
            start_offset=start_offset,
            end_offset=end_offset,
            char_count=len(text),
            word_count=len(text.split()),
            token_estimate=self.estimate_tokens(text),
            section_path=None,
            heading=document.title,
            heading_level=None,
            source_url=document.source_url,
            title=document.title,
            chunker_type="token",
            overlap_tokens=self.chunk_overlap if i > 0 else 0,
        )
        chunks.append(chunk)

    return chunks

estimate_tokens

Python
estimate_tokens(text: str) -> int

Estimate token count.

Uses tiktoken if available, otherwise approximates.

Source code in src/ragcrawl/chunking/token_chunker.py
Python
def estimate_tokens(self, text: str) -> int:
    """
    Estimate token count.

    Uses tiktoken if available, otherwise approximates.
    """
    if self.encoding:
        return len(self.encoding.encode(text))

    # Fallback: approximately 4 characters per token
    return len(text) // 4

Chunker

Bases: ABC

Abstract base class for content chunkers.

Chunkers split documents into segments optimized for embedding and retrieval in RAG pipelines.

chunk abstractmethod

Python
chunk(document: Document) -> list[Chunk]

Chunk a document into segments.

PARAMETER DESCRIPTION
document

Document to chunk.

TYPE: Document

RETURNS DESCRIPTION
list[Chunk]

List of chunks.

Source code in src/ragcrawl/chunking/chunker.py
Python
@abstractmethod
def chunk(self, document: Document) -> list[Chunk]:
    """
    Chunk a document into segments.

    Args:
        document: Document to chunk.

    Returns:
        List of chunks.
    """
    ...

estimate_tokens abstractmethod

Python
estimate_tokens(text: str) -> int

Estimate token count for text.

PARAMETER DESCRIPTION
text

Text to estimate.

TYPE: str

RETURNS DESCRIPTION
int

Estimated token count.

Source code in src/ragcrawl/chunking/chunker.py
Python
@abstractmethod
def estimate_tokens(self, text: str) -> int:
    """
    Estimate token count for text.

    Args:
        text: Text to estimate.

    Returns:
        Estimated token count.
    """
    ...