Filters API¶
ragcrawl provides URL filtering utilities to control which pages are crawled.
Overview¶
| Filter | Description |
|---|---|
LinkFilter |
Complete URL filtering with domains, patterns, and deduplication |
PatternMatcher |
Glob/regex pattern matching |
URLNormalizer |
URL normalization and hashing |
ExtensionFilter |
File extension filtering |
LinkFilter¶
The main filter class combining all filtering logic.
from ragcrawl.filters import LinkFilter, FilterReason
link_filter = LinkFilter(
allowed_domains=["docs.example.com"],
allow_subdomains=True,
include_patterns=["/docs/*", "/api/*"],
exclude_patterns=["/admin/*", "*secret*"],
blocked_extensions=[".pdf", ".zip", ".png"],
)
# Check if URL should be crawled
result = link_filter.filter("https://docs.example.com/api/users")
if result.allowed:
crawl_url(url)
else:
print(f"Filtered: {result.reason}")
# FilterReason.DOMAIN_NOT_ALLOWED
# FilterReason.EXCLUDED_PATTERN
# FilterReason.NO_INCLUDE_MATCH
# FilterReason.BLOCKED_EXTENSION
# FilterReason.ALREADY_SEEN
Deduplication¶
# Track seen URLs
link_filter.mark_seen("https://example.com/page1")
# Check with deduplication
result = link_filter.filter("https://example.com/page1", check_seen=True)
if not result.allowed:
print(f"Already seen: {result.reason == FilterReason.ALREADY_SEEN}")
Configuration¶
| Option | Type | Description |
|---|---|---|
allowed_domains |
list[str] | Domains to allow |
allow_subdomains |
bool | Include subdomains |
include_patterns |
list[str] | URL patterns to include |
exclude_patterns |
list[str] | URL patterns to exclude |
blocked_extensions |
list[str] | File extensions to skip |
PatternMatcher¶
Match URLs against glob or regex patterns.
from ragcrawl.filters import PatternMatcher
matcher = PatternMatcher(
include_patterns=["/docs/*", "/api/v1/*"],
exclude_patterns=["*internal*", "*private*"],
case_sensitive=False,
)
# Check if URL path should be included
if matcher.should_include("/docs/getting-started"):
print("URL matches include pattern")
if not matcher.should_include("/admin/settings"):
print("URL doesn't match any include pattern")
Pattern Syntax¶
| Pattern | Matches |
|---|---|
/docs/* |
/docs/anything |
**/api/* |
any/path/api/anything |
*.pdf |
Files ending in .pdf |
/api/v[12]/* |
/api/v1/ or /api/v2/ |
Patterns support both glob syntax (*, **, ?) and regex (when containing |, ^, $, etc.).
URLNormalizer¶
Normalize URLs for consistent comparison and hashing.
from ragcrawl.filters import URLNormalizer
normalizer = URLNormalizer(
remove_fragments=True,
remove_tracking_params=True,
sort_query_params=True,
)
# Normalize URL
normalized = normalizer.normalize(
"HTTPS://Example.COM/Page?utm_source=google&id=1#section"
)
# Result: "https://example.com/page?id=1"
# Get domain
domain = normalizer.get_domain("https://docs.example.com/page")
# Result: "docs.example.com"
# Get registered domain
base = normalizer.get_registered_domain("https://docs.example.com/page")
# Result: "example.com"
# Check same domain
same = normalizer.is_same_domain(
"https://docs.example.com",
"https://api.example.com",
include_subdomains=True,
)
# Result: True
Configuration¶
| Option | Type | Default | Description |
|---|---|---|---|
remove_fragments |
bool | True | Remove URL fragments (#) |
remove_tracking_params |
bool | True | Remove UTM params etc. |
sort_query_params |
bool | True | Sort query parameters |
lowercase_path |
bool | False | Lowercase URL path |
ExtensionFilter¶
Filter URLs by file extension.
from ragcrawl.filters import ExtensionFilter
ext_filter = ExtensionFilter(
blocked_extensions=[".pdf", ".png", ".jpg", ".zip", ".exe"]
)
# Check if URL is blocked
if ext_filter.is_blocked("https://example.com/report.pdf"):
print("PDF files are blocked")
# Get extension
ext = ext_filter.get_extension("https://example.com/file.tar.gz")
# Result: ".gz"
Integration Example¶
from ragcrawl.filters import LinkFilter
# Create comprehensive filter
link_filter = LinkFilter(
allowed_domains=["docs.python.org"],
allow_subdomains=False,
include_patterns=[
"/3/*", # Python 3 docs only
],
exclude_patterns=[
"*/whatsnew/*", # Skip what's new pages
"*/_sources/*", # Skip source files
],
blocked_extensions=[
".pdf", ".zip", ".tar.gz",
".png", ".jpg", ".gif", ".svg",
],
)
# Use in crawling
for url in discovered_urls:
result = link_filter.filter(url, check_seen=True)
if result.allowed:
link_filter.mark_seen(url)
queue.add(url)
Module Reference¶
URL filtering and normalization for ragcrawl.
LinkFilter
¶
LinkFilter(
allowed_domains: list[str] | None = None,
allow_subdomains: bool = True,
allowed_schemes: list[str] | None = None,
allowed_path_prefixes: list[str] | None = None,
blocked_extensions: list[str] | None = None,
include_patterns: list[str] | None = None,
exclude_patterns: list[str] | None = None,
blocked_query_params: list[str] | None = None,
)
Filters URLs based on domain, path, extension, and pattern constraints.
This is the main filter used by the crawler to determine which URLs to include in the frontier.
Initialize the link filter.
| PARAMETER | DESCRIPTION |
|---|---|
allowed_domains
|
Domains to allow (empty = all).
TYPE:
|
allow_subdomains
|
Whether to allow subdomains of allowed_domains.
TYPE:
|
allowed_schemes
|
URL schemes to allow (default: http, https).
TYPE:
|
allowed_path_prefixes
|
Path prefixes to allow (empty = all).
TYPE:
|
blocked_extensions
|
File extensions to block.
TYPE:
|
include_patterns
|
Regex/glob patterns for URLs to include.
TYPE:
|
exclude_patterns
|
Regex/glob patterns for URLs to exclude.
TYPE:
|
blocked_query_params
|
Query parameters to strip.
TYPE:
|
Source code in src/ragcrawl/filters/link_filter.py
filter
¶
filter(
url: str,
check_seen: bool = True,
current_depth: int = 0,
max_depth: int | None = None,
) -> FilterResult
Filter a URL and return the result.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to filter.
TYPE:
|
check_seen
|
Whether to check if URL was already seen.
TYPE:
|
current_depth
|
Current crawl depth.
TYPE:
|
max_depth
|
Maximum allowed depth.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
FilterResult
|
FilterResult with allowed status and reason. |
Source code in src/ragcrawl/filters/link_filter.py
| Python | |
|---|---|
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | |
mark_seen
¶
Mark a URL as seen and return normalized form.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to mark.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The normalized URL. |
Source code in src/ragcrawl/filters/link_filter.py
| Python | |
|---|---|
is_seen
¶
Check if URL has been seen.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if URL has been seen. |
Source code in src/ragcrawl/filters/link_filter.py
PatternMatcher
¶
PatternMatcher(
include_patterns: list[str] | None = None,
exclude_patterns: list[str] | None = None,
case_sensitive: bool = False,
)
Matches URLs against include/exclude patterns.
Supports both regex and glob patterns.
Initialize the pattern matcher.
| PARAMETER | DESCRIPTION |
|---|---|
include_patterns
|
Patterns for URLs to include (regex or glob).
TYPE:
|
exclude_patterns
|
Patterns for URLs to exclude (regex or glob).
TYPE:
|
case_sensitive
|
Whether pattern matching is case-sensitive.
TYPE:
|
Source code in src/ragcrawl/filters/patterns.py
matches_include
¶
Check if URL matches any include pattern.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if URL matches an include pattern or no include patterns defined. |
Source code in src/ragcrawl/filters/patterns.py
matches_exclude
¶
Check if URL matches any exclude pattern.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if URL matches an exclude pattern. |
Source code in src/ragcrawl/filters/patterns.py
| Python | |
|---|---|
should_include
¶
Determine if URL should be included based on patterns.
Exclude patterns take precedence over include patterns.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if URL should be included. |
Source code in src/ragcrawl/filters/patterns.py
get_match_reason
¶
Get the reason for inclusion/exclusion.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
A string describing the match, or None if included by default. |
Source code in src/ragcrawl/filters/patterns.py
URLNormalizer
¶
URLNormalizer(
remove_fragments: bool = True,
normalize_trailing_slash: bool = True,
sort_query_params: bool = True,
remove_query_params: list[str] | None = None,
lowercase_hostname: bool = True,
remove_default_ports: bool = True,
remove_www: bool = False,
)
Normalizes URLs for deterministic deduplication.
Handles: - Fragment removal - Trailing slash normalization - Query parameter sorting and filtering - Scheme normalization - Case normalization for hostname - Path normalization
Initialize the URL normalizer.
| PARAMETER | DESCRIPTION |
|---|---|
remove_fragments
|
Remove URL fragments (#...).
TYPE:
|
normalize_trailing_slash
|
Ensure consistent trailing slash handling.
TYPE:
|
sort_query_params
|
Sort query parameters alphabetically.
TYPE:
|
remove_query_params
|
List of query params to remove (e.g., tracking params).
TYPE:
|
lowercase_hostname
|
Lowercase the hostname.
TYPE:
|
remove_default_ports
|
Remove default ports (80, 443).
TYPE:
|
remove_www
|
Remove www. prefix from hostname.
TYPE:
|
Source code in src/ragcrawl/filters/url_normalizer.py
normalize
¶
Normalize a URL for deduplication.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to normalize.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The normalized URL string. |
Source code in src/ragcrawl/filters/url_normalizer.py
| Python | |
|---|---|
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | |
get_domain
¶
Extract the domain from a URL.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The domain (e.g., 'example.com'). |
Source code in src/ragcrawl/filters/url_normalizer.py
get_registered_domain
¶
Extract the registered domain (eTLD+1) from a URL.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The registered domain (e.g., 'example.com' for 'sub.example.com'). |
Source code in src/ragcrawl/filters/url_normalizer.py
is_same_domain
¶
Check if two URLs are on the same domain.
| PARAMETER | DESCRIPTION |
|---|---|
url1
|
First URL.
TYPE:
|
url2
|
Second URL.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if same domain. |
Source code in src/ragcrawl/filters/url_normalizer.py
is_same_registered_domain
¶
Check if two URLs are on the same registered domain.
This considers subdomains as the same domain.
| PARAMETER | DESCRIPTION |
|---|---|
url1
|
First URL.
TYPE:
|
url2
|
Second URL.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if same registered domain. |