CLI Reference¶

ragcrawl provides a powerful command-line interface for all crawling operations.

Installation¶

The CLI is included with the ragcrawl package:

Bash

pip install ragcrawl
ragcrawl --help

Commands Overview¶

Command	Description
`crawl`	Crawl a website from seed URLs
`sync`	Incrementally sync an existing site
`sites`	List all crawled sites
`runs`	List crawl runs for a site
`list`	List pages for a site
`config`	Manage configuration

crawl¶

Start a new crawl from one or more seed URLs.

Usage¶

Bash

ragcrawl crawl [OPTIONS] SEEDS...

Arguments¶

Argument	Description
`SEEDS`	One or more seed URLs to start crawling

Options¶

Option	Type	Default	Description
`--max-pages`	int	100	Maximum pages to crawl
`--max-depth`	int	10	Maximum link depth
`--delay`	float	1.0	Delay between requests (seconds)
`--include`	str	-	URL pattern to include (can repeat)
`--exclude`	str	-	URL pattern to exclude (can repeat)
`--output`	path	./output	Output directory
`--format`	choice	multi	Output format: single, multi
`--db`	path	crawler.duckdb	Database file path
`--site-name`	str	auto	Custom name for the site
`--user-agent`	str	ragcrawl	Custom user agent
`--robots`	choice	strict	Robots mode: strict, off
`--browser`	flag	false	Enable browser rendering
`--verbose`	flag	false	Verbose output

Examples¶

Basic crawl:

Bash

ragcrawl crawl https://docs.example.com

With limits and filters:

Bash

ragcrawl crawl https://docs.example.com \
    --max-pages 500 \
    --max-depth 5 \
    --include "/docs/*" \
    --exclude "/api/internal/*"

Custom output:

Bash

ragcrawl crawl https://docs.example.com \
    --output ./knowledge-base \
    --format single

Browser rendering for JavaScript sites:

Bash

ragcrawl crawl https://spa.example.com --browser

sync¶

Incrementally update a previously crawled site.

Usage¶

Bash

ragcrawl sync [OPTIONS] SITE_ID

Arguments¶

Argument	Description
`SITE_ID`	Site ID to sync (from `ragcrawl sites`)

Options¶

Option	Type	Default	Description
`--max-pages`	int	unlimited	Maximum pages to sync
`--max-age`	float	-	Only sync pages older than N hours
`--sitemap`	flag	true	Use sitemap for discovery
`--conditional`	flag	true	Use conditional requests
`--output`	path	./output	Output directory
`--db`	path	crawler.duckdb	Database file path

Examples¶

Basic sync:

Bash

ragcrawl sync site_abc123

Sync with limits:

Bash

ragcrawl sync site_abc123 --max-pages 100 --max-age 24

sites¶

List all crawled sites in the database.

Usage¶

Bash

ragcrawl sites [OPTIONS]

Options¶

Option	Type	Default	Description
`--db`	path	crawler.duckdb	Database file path
`--json`	flag	false	Output as JSON

Example Output¶

Text Only

ID              Name                 Seeds                    Pages  Last Crawl
─────────────────────────────────────────────────────────────────────────────────
site_abc123     Example Docs         https://docs.example.com    150  2024-01-15
site_def456     API Reference        https://api.example.com      75  2024-01-14

runs¶

List crawl runs for a specific site.

Usage¶

Bash

ragcrawl runs [OPTIONS] SITE_ID

Arguments¶

Argument	Description
`SITE_ID`	Site ID to list runs for

Options¶

Option	Type	Default	Description
`--limit`	int	10	Number of runs to show
`--db`	path	crawler.duckdb	Database file path
`--json`	flag	false	Output as JSON

Example Output¶

Text Only

Run ID          Status     Started              Pages  Duration
───────────────────────────────────────────────────────────────
run_xyz789      completed  2024-01-15 10:30:00    150  5m 23s
run_xyz788      completed  2024-01-14 09:15:00    148  5m 10s
run_xyz787      failed     2024-01-13 08:00:00     45  2m 15s

list¶

List pages for a specific site.

Usage¶

Bash

ragcrawl list [OPTIONS] SITE_ID

Arguments¶

Argument	Description
`SITE_ID`	Site ID to list pages for

Options¶

Option	Type	Default	Description
`--limit`	int	100	Number of pages to show
`--status`	int	-	Filter by HTTP status
`--db`	path	crawler.duckdb	Database file path
`--json`	flag	false	Output as JSON

config¶

Manage ragcrawl configuration.

Subcommands¶

Subcommand	Description
`show`	Show current configuration
`set`	Set a configuration value
`reset`	Reset to defaults
`path`	Show config file path

Examples¶

Show config:

Bash

ragcrawl config show

Set default database:

Bash

ragcrawl config set db_path ./my-crawler.duckdb

Global Options¶

These options are available for all commands:

Option	Description
`--help`	Show help message
`--version`	Show version
`--verbose` / `-v`	Increase verbosity
`--quiet` / `-q`	Suppress output

Exit Codes¶

Code	Meaning
0	Success
1	General error
2	Configuration error
3	Network error
4	Storage error

Environment Variables¶

Variable	Description
`RAGCRAWL_DB_PATH`	Default database path
`RAGCRAWL_OUTPUT_DIR`	Default output directory
`RAGCRAWL_LOG_LEVEL`	Logging level (DEBUG, INFO, WARNING, ERROR)

CLI Reference¶

Installation¶

Commands Overview¶

crawl¶

Usage¶

Arguments¶

Options¶

Examples¶

sync¶

Usage¶

Arguments¶

Options¶

Examples¶

sites¶

Usage¶

Options¶

Example Output¶

runs¶

Usage¶

Arguments¶

Options¶

Example Output¶

list¶

Usage¶

Arguments¶

Options¶

config¶

Subcommands¶

Examples¶

Global Options¶

Exit Codes¶

Environment Variables¶

See Also¶