Skip to content

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project follows Semantic Versioning.

[0.0.2] - 2025-11-27

Added

  • Enhanced documentation with community guidelines and contribution guide
  • Markdown extraction configuration reference and README coverage for MarkdownConfig
  • CLI support for --markdown-config (TOML/JSON) and docs showing how to invoke it

Changed

  • Tuned default Markdown extraction for doc-like sites (pruning threshold 0.55, min words 15, text block threshold 15)

[0.0.1] - 2025-11-26

Added

  • Project scaffolding and initial MVP baseline for ragcrawl.
  • Initial architecture for recursive crawling, KB/RAG-ready Markdown generation, chunking, and exporters.
  • Pluggable storage backends: DuckDB (default) + optional DynamoDB via PynamoDB.
  • Incremental sync using conditional requests (ETag/Last-Modified), sitemap prioritization, and content-hash diffs.
  • Markdown publishing formats: single-page output and multi-page output with folder structure and link rewriting.
  • Centralized DuckDB storage defaulting to ~/.ragcrawl/ and CLI commands for config/listing.

Changed

  • N/A

Fixed

  • N/A