CrawlJob¶
The CrawlJob class is the main entry point for crawling websites.
Overview¶
CrawlJob orchestrates the entire crawling process:
- Initializes the storage backend
- Creates or retrieves the site record
- Manages the URL frontier
- Coordinates fetching, extraction, and storage
- Tracks statistics and handles errors
Usage¶
Basic Crawl¶
Python
import asyncio
from ragcrawl.config import CrawlerConfig
from ragcrawl.core import CrawlJob
config = CrawlerConfig(
seeds=["https://docs.example.com"],
max_pages=100,
max_depth=5,
)
job = CrawlJob(config)
result = asyncio.run(job.run())
# Access results
print(f"Pages crawled: {result.stats.pages_crawled}")
print(f"Pages failed: {result.stats.pages_failed}")
for doc in result.documents:
print(f"- {doc.title}: {doc.source_url}")
With Callbacks¶
Python
from ragcrawl.hooks import CrawlCallbacks
class MyCallbacks(CrawlCallbacks):
def on_page_crawled(self, page, version):
print(f"Crawled: {page.url}")
def on_page_error(self, url, error):
print(f"Error: {url} - {error}")
job = CrawlJob(config, callbacks=MyCallbacks())
result = asyncio.run(job.run())
Graceful Stop¶
Python
import asyncio
import signal
job = CrawlJob(config)
def handle_signal(sig, frame):
asyncio.create_task(job.stop())
signal.signal(signal.SIGINT, handle_signal)
result = asyncio.run(job.run())
Configuration¶
See CrawlerConfig for all options.
Key options:
| Option | Type | Description |
|---|---|---|
seeds |
list[str] | Starting URLs |
max_pages |
int | Maximum pages to crawl |
max_depth |
int | Maximum link depth |
delay_seconds |
float | Delay between requests |
include_patterns |
list[str] | URL patterns to include |
exclude_patterns |
list[str] | URL patterns to exclude |
API Reference¶
CrawlJob
¶
Main crawl job orchestrator.
Coordinates the frontier, fetcher, extractor, and storage to perform a complete crawl.
Initialize a crawl job.
| PARAMETER | DESCRIPTION |
|---|---|
config
|
Crawler configuration.
TYPE:
|
Source code in src/ragcrawl/core/crawl_job.py
run
async
¶
Execute the crawl job.
| RETURNS | DESCRIPTION |
|---|---|
CrawlResult
|
CrawlResult with statistics and documents. |
Source code in src/ragcrawl/core/crawl_job.py
| Python | |
|---|---|
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | |