Chunking API¶
ragcrawl provides chunking utilities to prepare content for embedding models.
Overview¶
| Chunker | Description | Use Case |
|---|---|---|
HeadingChunker |
Splits by markdown headings | Preserve document structure |
TokenChunker |
Splits by token count | Fixed-size chunks for embeddings |
HeadingChunker¶
Splits content at heading boundaries while respecting token limits.
Usage¶
from ragcrawl.chunking import HeadingChunker
from ragcrawl.models import Document
chunker = HeadingChunker(
max_tokens=500,
min_chunk_chars=100,
heading_levels=[1, 2, 3],
)
# Chunk a single document
chunks = chunker.chunk(document)
# Chunk multiple documents
all_chunks = chunker.chunk_documents(documents)
for chunk in chunks:
print(f"Section: {' > '.join(chunk.section_path)}")
print(f"Tokens: ~{chunk.token_estimate}")
print(chunk.content[:200])
print()
Configuration¶
| Option | Type | Default | Description |
|---|---|---|---|
max_tokens |
int | 500 | Maximum tokens per chunk |
min_chunk_chars |
int | 100 | Minimum characters for a chunk |
heading_levels |
list[int] | [1,2,3] | Heading levels to split on |
include_heading |
bool | True | Include heading in chunk |
preserve_code_blocks |
bool | True | Keep code blocks intact |
TokenChunker¶
Splits content into fixed-size chunks by token count.
Usage¶
from ragcrawl.chunking import TokenChunker
chunker = TokenChunker(
max_tokens=500,
overlap_tokens=50,
encoding_name="cl100k_base",
)
chunks = chunker.chunk(document)
for chunk in chunks:
print(f"Chunk {chunk.chunk_index + 1}/{chunk.total_chunks}")
print(f"Tokens: {chunk.token_estimate}")
Configuration¶
| Option | Type | Default | Description |
|---|---|---|---|
max_tokens |
int | 500 | Maximum tokens per chunk |
overlap_tokens |
int | 50 | Overlap between chunks |
encoding_name |
str | "cl100k_base" | Tokenizer encoding |
Choosing a Chunker¶
Use HeadingChunker when:¶
- Document has clear heading structure
- Semantic boundaries are important
- Context preservation matters
- Building hierarchical indexes
Use TokenChunker when:¶
- Fixed chunk sizes needed
- Document lacks clear structure
- Maximum control over chunk boundaries
- Optimizing for specific embedding models
Custom Chunking¶
Implement the Chunker protocol for custom logic:
from ragcrawl.chunking import Chunker
from ragcrawl.models import Document, Chunk
class CustomChunker(Chunker):
def chunk(self, document: Document) -> list[Chunk]:
chunks = []
# Your chunking logic here
return chunks
def chunk_documents(self, documents: list[Document]) -> list[Chunk]:
all_chunks = []
for doc in documents:
all_chunks.extend(self.chunk(doc))
return all_chunks
Integration with Embedding APIs¶
from ragcrawl.chunking import HeadingChunker
import openai
chunker = HeadingChunker(max_tokens=500)
chunks = chunker.chunk_documents(documents)
# Prepare texts with context
texts = []
for chunk in chunks:
# Add section context as prefix
prefix = " > ".join(chunk.section_path) if chunk.section_path else ""
text = f"[{prefix}]\n\n{chunk.content}" if prefix else chunk.content
texts.append(text)
# Get embeddings
response = openai.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
# Store with metadata
for i, embedding in enumerate(response.data):
store_vector(
id=chunks[i].chunk_id,
vector=embedding.embedding,
metadata={
"doc_id": chunks[i].doc_id,
"section": chunks[i].section_path,
"heading": chunks[i].heading,
}
)
Module Reference¶
Content chunking for RAG pipelines.
HeadingChunker
¶
HeadingChunker(
min_chunk_size: int = 100,
max_chunk_size: int = 2000,
heading_levels: list[int] | None = None,
include_heading_in_chunk: bool = True,
overlap_size: int = 0,
)
Bases: Chunker
Chunks markdown content by headings.
Creates chunks that respect document structure by splitting at heading boundaries while maintaining context.
Initialize heading chunker.
| PARAMETER | DESCRIPTION |
|---|---|
min_chunk_size
|
Minimum chunk size in characters.
TYPE:
|
max_chunk_size
|
Maximum chunk size in characters.
TYPE:
|
heading_levels
|
Heading levels to split on (default: [1, 2, 3]).
TYPE:
|
include_heading_in_chunk
|
Include the heading in chunk content.
TYPE:
|
overlap_size
|
Characters to overlap between chunks.
TYPE:
|
Source code in src/ragcrawl/chunking/heading_chunker.py
chunk
¶
Chunk document by headings.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document to chunk.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[Chunk]
|
List of chunks. |
Source code in src/ragcrawl/chunking/heading_chunker.py
estimate_tokens
¶
TokenChunker
¶
TokenChunker(
chunk_size: int = 512,
chunk_overlap: int = 50,
encoding_name: str = "cl100k_base",
separators: list[str] | None = None,
)
Bases: Chunker
Chunks content by token count.
Uses tiktoken for accurate token counting and respects natural text boundaries (sentences, paragraphs).
Initialize token chunker.
| PARAMETER | DESCRIPTION |
|---|---|
chunk_size
|
Target chunk size in tokens.
TYPE:
|
chunk_overlap
|
Token overlap between chunks.
TYPE:
|
encoding_name
|
Tiktoken encoding name.
TYPE:
|
separators
|
Text separators to try, in order.
TYPE:
|
Source code in src/ragcrawl/chunking/token_chunker.py
chunk
¶
Chunk document by token count.
| PARAMETER | DESCRIPTION |
|---|---|
document
|
Document to chunk.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[Chunk]
|
List of chunks. |
Source code in src/ragcrawl/chunking/token_chunker.py
estimate_tokens
¶
Estimate token count.
Uses tiktoken if available, otherwise approximates.
Source code in src/ragcrawl/chunking/token_chunker.py
| Python | |
|---|---|
Chunker
¶
Bases: ABC
Abstract base class for content chunkers.
Chunkers split documents into segments optimized for embedding and retrieval in RAG pipelines.
chunk
abstractmethod
¶
estimate_tokens
abstractmethod
¶
Estimate token count for text.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Text to estimate.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
Estimated token count. |