Skip to content

Installation

Requirements

  • Python 3.10 or higher
  • pip or uv package manager

Installation Options

Basic Installation

The basic installation includes DuckDB storage and HTTP-only fetching:

Bash
pip install ragcrawl

Or with uv:

Bash
uv pip install ragcrawl

Browser Rendering Support

For JavaScript-heavy sites, install with browser support:

Bash
pip install ragcrawl[browser]

This installs Playwright for headless browser rendering.

After installation, set up Playwright:

Bash
playwright install chromium

DynamoDB Support

For cloud deployments with AWS DynamoDB:

Bash
pip install ragcrawl[dynamodb]

Full Installation

Install all optional dependencies:

Bash
pip install ragcrawl[all]

Development Installation

For contributing to the project:

Bash
git clone https://github.com/your-org/ragcrawl.git
cd ragcrawl
pip install -e ".[dev]"

Verifying Installation

Test your installation:

Bash
# Check CLI is available
ragcrawl --version

# Run a simple crawl
ragcrawl crawl https://example.com --max-pages 5 --output ./test-output

Dependencies

Core Dependencies

Package Purpose
crawl4ai Web fetching and HTML-to-Markdown conversion
duckdb Default local storage backend
httpx Async HTTP client
pydantic Data validation and configuration
structlog Structured logging
xxhash Fast content hashing
tiktoken Token counting for chunking
click Command-line interface

Optional Dependencies

Package Purpose Extra
playwright Browser rendering [browser]
pynamodb DynamoDB ORM [dynamodb]

Troubleshooting

Playwright Installation Issues

If you encounter issues with Playwright:

Bash
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install libnss3 libatk1.0-0 libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2

# Install browsers
playwright install chromium

DuckDB Permission Issues

Ensure the storage directory is writable:

Bash
# Create with proper permissions
mkdir -p ./data
chmod 755 ./data
ragcrawl crawl https://example.com --storage ./data/crawler.duckdb

AWS Credentials for DynamoDB

Set up AWS credentials for DynamoDB:

Bash
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

Or use AWS profiles:

Bash
aws configure --profile crawler
export AWS_PROFILE=crawler