Skip to content

Filters API

ragcrawl provides URL filtering utilities to control which pages are crawled.

Overview

Filter Description
LinkFilter Complete URL filtering with domains, patterns, and deduplication
PatternMatcher Glob/regex pattern matching
URLNormalizer URL normalization and hashing
ExtensionFilter File extension filtering

LinkFilter

The main filter class combining all filtering logic.

Python
from ragcrawl.filters import LinkFilter, FilterReason

link_filter = LinkFilter(
    allowed_domains=["docs.example.com"],
    allow_subdomains=True,
    include_patterns=["/docs/*", "/api/*"],
    exclude_patterns=["/admin/*", "*secret*"],
    blocked_extensions=[".pdf", ".zip", ".png"],
)

# Check if URL should be crawled
result = link_filter.filter("https://docs.example.com/api/users")

if result.allowed:
    crawl_url(url)
else:
    print(f"Filtered: {result.reason}")
    # FilterReason.DOMAIN_NOT_ALLOWED
    # FilterReason.EXCLUDED_PATTERN
    # FilterReason.NO_INCLUDE_MATCH
    # FilterReason.BLOCKED_EXTENSION
    # FilterReason.ALREADY_SEEN

Deduplication

Python
# Track seen URLs
link_filter.mark_seen("https://example.com/page1")

# Check with deduplication
result = link_filter.filter("https://example.com/page1", check_seen=True)
if not result.allowed:
    print(f"Already seen: {result.reason == FilterReason.ALREADY_SEEN}")

Configuration

Option Type Description
allowed_domains list[str] Domains to allow
allow_subdomains bool Include subdomains
include_patterns list[str] URL patterns to include
exclude_patterns list[str] URL patterns to exclude
blocked_extensions list[str] File extensions to skip

PatternMatcher

Match URLs against glob or regex patterns.

Python
from ragcrawl.filters import PatternMatcher

matcher = PatternMatcher(
    include_patterns=["/docs/*", "/api/v1/*"],
    exclude_patterns=["*internal*", "*private*"],
    case_sensitive=False,
)

# Check if URL path should be included
if matcher.should_include("/docs/getting-started"):
    print("URL matches include pattern")

if not matcher.should_include("/admin/settings"):
    print("URL doesn't match any include pattern")

Pattern Syntax

Pattern Matches
/docs/* /docs/anything
**/api/* any/path/api/anything
*.pdf Files ending in .pdf
/api/v[12]/* /api/v1/ or /api/v2/

Patterns support both glob syntax (*, **, ?) and regex (when containing |, ^, $, etc.).

URLNormalizer

Normalize URLs for consistent comparison and hashing.

Python
from ragcrawl.filters import URLNormalizer

normalizer = URLNormalizer(
    remove_fragments=True,
    remove_tracking_params=True,
    sort_query_params=True,
)

# Normalize URL
normalized = normalizer.normalize(
    "HTTPS://Example.COM/Page?utm_source=google&id=1#section"
)
# Result: "https://example.com/page?id=1"

# Get domain
domain = normalizer.get_domain("https://docs.example.com/page")
# Result: "docs.example.com"

# Get registered domain
base = normalizer.get_registered_domain("https://docs.example.com/page")
# Result: "example.com"

# Check same domain
same = normalizer.is_same_domain(
    "https://docs.example.com",
    "https://api.example.com",
    include_subdomains=True,
)
# Result: True

Configuration

Option Type Default Description
remove_fragments bool True Remove URL fragments (#)
remove_tracking_params bool True Remove UTM params etc.
sort_query_params bool True Sort query parameters
lowercase_path bool False Lowercase URL path

ExtensionFilter

Filter URLs by file extension.

Python
from ragcrawl.filters import ExtensionFilter

ext_filter = ExtensionFilter(
    blocked_extensions=[".pdf", ".png", ".jpg", ".zip", ".exe"]
)

# Check if URL is blocked
if ext_filter.is_blocked("https://example.com/report.pdf"):
    print("PDF files are blocked")

# Get extension
ext = ext_filter.get_extension("https://example.com/file.tar.gz")
# Result: ".gz"

Integration Example

Python
from ragcrawl.filters import LinkFilter

# Create comprehensive filter
link_filter = LinkFilter(
    allowed_domains=["docs.python.org"],
    allow_subdomains=False,
    include_patterns=[
        "/3/*",          # Python 3 docs only
    ],
    exclude_patterns=[
        "*/whatsnew/*",  # Skip what's new pages
        "*/_sources/*",  # Skip source files
    ],
    blocked_extensions=[
        ".pdf", ".zip", ".tar.gz",
        ".png", ".jpg", ".gif", ".svg",
    ],
)

# Use in crawling
for url in discovered_urls:
    result = link_filter.filter(url, check_seen=True)
    if result.allowed:
        link_filter.mark_seen(url)
        queue.add(url)

Module Reference

URL filtering and normalization for ragcrawl.

LinkFilter

Python
LinkFilter(
    allowed_domains: list[str] | None = None,
    allow_subdomains: bool = True,
    allowed_schemes: list[str] | None = None,
    allowed_path_prefixes: list[str] | None = None,
    blocked_extensions: list[str] | None = None,
    include_patterns: list[str] | None = None,
    exclude_patterns: list[str] | None = None,
    blocked_query_params: list[str] | None = None,
)

Filters URLs based on domain, path, extension, and pattern constraints.

This is the main filter used by the crawler to determine which URLs to include in the frontier.

Initialize the link filter.

PARAMETER DESCRIPTION
allowed_domains

Domains to allow (empty = all).

TYPE: list[str] | None DEFAULT: None

allow_subdomains

Whether to allow subdomains of allowed_domains.

TYPE: bool DEFAULT: True

allowed_schemes

URL schemes to allow (default: http, https).

TYPE: list[str] | None DEFAULT: None

allowed_path_prefixes

Path prefixes to allow (empty = all).

TYPE: list[str] | None DEFAULT: None

blocked_extensions

File extensions to block.

TYPE: list[str] | None DEFAULT: None

include_patterns

Regex/glob patterns for URLs to include.

TYPE: list[str] | None DEFAULT: None

exclude_patterns

Regex/glob patterns for URLs to exclude.

TYPE: list[str] | None DEFAULT: None

blocked_query_params

Query parameters to strip.

TYPE: list[str] | None DEFAULT: None

Source code in src/ragcrawl/filters/link_filter.py
Python
def __init__(
    self,
    allowed_domains: list[str] | None = None,
    allow_subdomains: bool = True,
    allowed_schemes: list[str] | None = None,
    allowed_path_prefixes: list[str] | None = None,
    blocked_extensions: list[str] | None = None,
    include_patterns: list[str] | None = None,
    exclude_patterns: list[str] | None = None,
    blocked_query_params: list[str] | None = None,
) -> None:
    """
    Initialize the link filter.

    Args:
        allowed_domains: Domains to allow (empty = all).
        allow_subdomains: Whether to allow subdomains of allowed_domains.
        allowed_schemes: URL schemes to allow (default: http, https).
        allowed_path_prefixes: Path prefixes to allow (empty = all).
        blocked_extensions: File extensions to block.
        include_patterns: Regex/glob patterns for URLs to include.
        exclude_patterns: Regex/glob patterns for URLs to exclude.
        blocked_query_params: Query parameters to strip.
    """
    self.allowed_domains = set(d.lower() for d in (allowed_domains or []))
    self.allow_subdomains = allow_subdomains
    self.allowed_schemes = set(s.lower() for s in (allowed_schemes or ["http", "https"]))
    self.allowed_path_prefixes = list(allowed_path_prefixes or [])

    self.normalizer = URLNormalizer(remove_query_params=blocked_query_params)
    self.pattern_matcher = PatternMatcher(include_patterns, exclude_patterns)
    self.extension_filter = ExtensionFilter(blocked_extensions)

    # Track seen URLs for deduplication
    self._seen_urls: set[str] = set()

seen_count property

Python
seen_count: int

Get the count of seen URLs.

filter

Python
filter(
    url: str,
    check_seen: bool = True,
    current_depth: int = 0,
    max_depth: int | None = None,
) -> FilterResult

Filter a URL and return the result.

PARAMETER DESCRIPTION
url

The URL to filter.

TYPE: str

check_seen

Whether to check if URL was already seen.

TYPE: bool DEFAULT: True

current_depth

Current crawl depth.

TYPE: int DEFAULT: 0

max_depth

Maximum allowed depth.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
FilterResult

FilterResult with allowed status and reason.

Source code in src/ragcrawl/filters/link_filter.py
Python
def filter(
    self,
    url: str,
    check_seen: bool = True,
    current_depth: int = 0,
    max_depth: int | None = None,
) -> FilterResult:
    """
    Filter a URL and return the result.

    Args:
        url: The URL to filter.
        check_seen: Whether to check if URL was already seen.
        current_depth: Current crawl depth.
        max_depth: Maximum allowed depth.

    Returns:
        FilterResult with allowed status and reason.
    """
    # Parse and validate URL
    try:
        parsed = urlparse(url)
    except Exception:
        return FilterResult(
            allowed=False,
            reason=FilterReason.INVALID_URL,
            details="Failed to parse URL",
        )

    if not parsed.scheme or not parsed.netloc:
        return FilterResult(
            allowed=False,
            reason=FilterReason.INVALID_URL,
            details="Missing scheme or netloc",
        )

    # Scheme check
    if parsed.scheme.lower() not in self.allowed_schemes:
        return FilterResult(
            allowed=False,
            reason=FilterReason.INVALID_SCHEME,
            details=f"Scheme '{parsed.scheme}' not allowed",
        )

    # Normalize URL
    normalized = self.normalizer.normalize(url)

    # Deduplication check
    if check_seen and normalized in self._seen_urls:
        return FilterResult(
            allowed=False,
            reason=FilterReason.ALREADY_SEEN,
            normalized_url=normalized,
        )

    # Depth check
    if max_depth is not None and current_depth > max_depth:
        return FilterResult(
            allowed=False,
            reason=FilterReason.MAX_DEPTH_EXCEEDED,
            normalized_url=normalized,
            details=f"Depth {current_depth} exceeds max {max_depth}",
        )

    # Domain check
    if self.allowed_domains:
        hostname = parsed.netloc.lower()
        # Remove port if present
        if ":" in hostname:
            hostname = hostname.split(":")[0]

        if not self._is_domain_allowed(hostname):
            return FilterResult(
                allowed=False,
                reason=FilterReason.DOMAIN_NOT_ALLOWED,
                normalized_url=normalized,
                details=f"Domain '{hostname}' not in allowed list",
            )

    # Path prefix check
    if self.allowed_path_prefixes:
        path = parsed.path
        if not any(path.startswith(prefix) for prefix in self.allowed_path_prefixes):
            return FilterResult(
                allowed=False,
                reason=FilterReason.PATH_NOT_ALLOWED,
                normalized_url=normalized,
                details=f"Path '{path}' doesn't match allowed prefixes",
            )

    # Extension check
    if self.extension_filter.is_blocked(url):
        ext = self.extension_filter.get_extension(url)
        return FilterResult(
            allowed=False,
            reason=FilterReason.BLOCKED_EXTENSION,
            normalized_url=normalized,
            details=f"Extension '{ext}' is blocked",
        )

    # Pattern check
    if not self.pattern_matcher.should_include(url):
        reason = self.pattern_matcher.get_match_reason(url)
        if self.pattern_matcher.matches_exclude(url):
            return FilterResult(
                allowed=False,
                reason=FilterReason.EXCLUDED_PATTERN,
                normalized_url=normalized,
                details=reason,
            )
        else:
            return FilterResult(
                allowed=False,
                reason=FilterReason.NO_INCLUDE_MATCH,
                normalized_url=normalized,
                details=reason,
            )

    # URL is allowed
    return FilterResult(
        allowed=True,
        reason=FilterReason.ALLOWED,
        normalized_url=normalized,
    )

mark_seen

Python
mark_seen(url: str) -> str

Mark a URL as seen and return normalized form.

PARAMETER DESCRIPTION
url

The URL to mark.

TYPE: str

RETURNS DESCRIPTION
str

The normalized URL.

Source code in src/ragcrawl/filters/link_filter.py
Python
def mark_seen(self, url: str) -> str:
    """
    Mark a URL as seen and return normalized form.

    Args:
        url: The URL to mark.

    Returns:
        The normalized URL.
    """
    normalized = self.normalizer.normalize(url)
    self._seen_urls.add(normalized)
    return normalized

is_seen

Python
is_seen(url: str) -> bool

Check if URL has been seen.

PARAMETER DESCRIPTION
url

The URL to check.

TYPE: str

RETURNS DESCRIPTION
bool

True if URL has been seen.

Source code in src/ragcrawl/filters/link_filter.py
Python
def is_seen(self, url: str) -> bool:
    """
    Check if URL has been seen.

    Args:
        url: The URL to check.

    Returns:
        True if URL has been seen.
    """
    normalized = self.normalizer.normalize(url)
    return normalized in self._seen_urls

clear_seen

Python
clear_seen() -> None

Clear the set of seen URLs.

Source code in src/ragcrawl/filters/link_filter.py
Python
def clear_seen(self) -> None:
    """Clear the set of seen URLs."""
    self._seen_urls.clear()

PatternMatcher

Python
PatternMatcher(
    include_patterns: list[str] | None = None,
    exclude_patterns: list[str] | None = None,
    case_sensitive: bool = False,
)

Matches URLs against include/exclude patterns.

Supports both regex and glob patterns.

Initialize the pattern matcher.

PARAMETER DESCRIPTION
include_patterns

Patterns for URLs to include (regex or glob).

TYPE: list[str] | None DEFAULT: None

exclude_patterns

Patterns for URLs to exclude (regex or glob).

TYPE: list[str] | None DEFAULT: None

case_sensitive

Whether pattern matching is case-sensitive.

TYPE: bool DEFAULT: False

Source code in src/ragcrawl/filters/patterns.py
Python
def __init__(
    self,
    include_patterns: list[str] | None = None,
    exclude_patterns: list[str] | None = None,
    case_sensitive: bool = False,
) -> None:
    """
    Initialize the pattern matcher.

    Args:
        include_patterns: Patterns for URLs to include (regex or glob).
        exclude_patterns: Patterns for URLs to exclude (regex or glob).
        case_sensitive: Whether pattern matching is case-sensitive.
    """
    self.case_sensitive = case_sensitive
    self._include_patterns = self._compile_patterns(include_patterns or [])
    self._exclude_patterns = self._compile_patterns(exclude_patterns or [])

matches_include

Python
matches_include(url: str) -> bool

Check if URL matches any include pattern.

PARAMETER DESCRIPTION
url

The URL to check.

TYPE: str

RETURNS DESCRIPTION
bool

True if URL matches an include pattern or no include patterns defined.

Source code in src/ragcrawl/filters/patterns.py
Python
def matches_include(self, url: str) -> bool:
    """
    Check if URL matches any include pattern.

    Args:
        url: The URL to check.

    Returns:
        True if URL matches an include pattern or no include patterns defined.
    """
    if not self._include_patterns:
        return True

    return any(p.search(url) for p in self._include_patterns)

matches_exclude

Python
matches_exclude(url: str) -> bool

Check if URL matches any exclude pattern.

PARAMETER DESCRIPTION
url

The URL to check.

TYPE: str

RETURNS DESCRIPTION
bool

True if URL matches an exclude pattern.

Source code in src/ragcrawl/filters/patterns.py
Python
def matches_exclude(self, url: str) -> bool:
    """
    Check if URL matches any exclude pattern.

    Args:
        url: The URL to check.

    Returns:
        True if URL matches an exclude pattern.
    """
    if not self._exclude_patterns:
        return False

    return any(p.search(url) for p in self._exclude_patterns)

should_include

Python
should_include(url: str) -> bool

Determine if URL should be included based on patterns.

Exclude patterns take precedence over include patterns.

PARAMETER DESCRIPTION
url

The URL to check.

TYPE: str

RETURNS DESCRIPTION
bool

True if URL should be included.

Source code in src/ragcrawl/filters/patterns.py
Python
def should_include(self, url: str) -> bool:
    """
    Determine if URL should be included based on patterns.

    Exclude patterns take precedence over include patterns.

    Args:
        url: The URL to check.

    Returns:
        True if URL should be included.
    """
    # Exclude takes precedence
    if self.matches_exclude(url):
        return False

    return self.matches_include(url)

get_match_reason

Python
get_match_reason(url: str) -> str | None

Get the reason for inclusion/exclusion.

PARAMETER DESCRIPTION
url

The URL to check.

TYPE: str

RETURNS DESCRIPTION
str | None

A string describing the match, or None if included by default.

Source code in src/ragcrawl/filters/patterns.py
Python
def get_match_reason(self, url: str) -> str | None:
    """
    Get the reason for inclusion/exclusion.

    Args:
        url: The URL to check.

    Returns:
        A string describing the match, or None if included by default.
    """
    for pattern in self._exclude_patterns:
        if pattern.search(url):
            return f"excluded by pattern: {pattern.pattern}"

    if self._include_patterns:
        for pattern in self._include_patterns:
            if pattern.search(url):
                return f"included by pattern: {pattern.pattern}"

        return "no include pattern matched"

    return None

URLNormalizer

Python
URLNormalizer(
    remove_fragments: bool = True,
    normalize_trailing_slash: bool = True,
    sort_query_params: bool = True,
    remove_query_params: list[str] | None = None,
    lowercase_hostname: bool = True,
    remove_default_ports: bool = True,
    remove_www: bool = False,
)

Normalizes URLs for deterministic deduplication.

Handles: - Fragment removal - Trailing slash normalization - Query parameter sorting and filtering - Scheme normalization - Case normalization for hostname - Path normalization

Initialize the URL normalizer.

PARAMETER DESCRIPTION
remove_fragments

Remove URL fragments (#...).

TYPE: bool DEFAULT: True

normalize_trailing_slash

Ensure consistent trailing slash handling.

TYPE: bool DEFAULT: True

sort_query_params

Sort query parameters alphabetically.

TYPE: bool DEFAULT: True

remove_query_params

List of query params to remove (e.g., tracking params).

TYPE: list[str] | None DEFAULT: None

lowercase_hostname

Lowercase the hostname.

TYPE: bool DEFAULT: True

remove_default_ports

Remove default ports (80, 443).

TYPE: bool DEFAULT: True

remove_www

Remove www. prefix from hostname.

TYPE: bool DEFAULT: False

Source code in src/ragcrawl/filters/url_normalizer.py
Python
def __init__(
    self,
    remove_fragments: bool = True,
    normalize_trailing_slash: bool = True,
    sort_query_params: bool = True,
    remove_query_params: list[str] | None = None,
    lowercase_hostname: bool = True,
    remove_default_ports: bool = True,
    remove_www: bool = False,
) -> None:
    """
    Initialize the URL normalizer.

    Args:
        remove_fragments: Remove URL fragments (#...).
        normalize_trailing_slash: Ensure consistent trailing slash handling.
        sort_query_params: Sort query parameters alphabetically.
        remove_query_params: List of query params to remove (e.g., tracking params).
        lowercase_hostname: Lowercase the hostname.
        remove_default_ports: Remove default ports (80, 443).
        remove_www: Remove www. prefix from hostname.
    """
    self.remove_fragments = remove_fragments
    self.normalize_trailing_slash = normalize_trailing_slash
    self.sort_query_params = sort_query_params
    self.remove_query_params = set(remove_query_params or [])
    self.lowercase_hostname = lowercase_hostname
    self.remove_default_ports = remove_default_ports
    self.remove_www = remove_www

    # Default tracking params to remove
    self.default_tracking_params = {
        "utm_source",
        "utm_medium",
        "utm_campaign",
        "utm_term",
        "utm_content",
        "fbclid",
        "gclid",
        "ref",
        "source",
    }

normalize

Python
normalize(url: str) -> str

Normalize a URL for deduplication.

PARAMETER DESCRIPTION
url

The URL to normalize.

TYPE: str

RETURNS DESCRIPTION
str

The normalized URL string.

Source code in src/ragcrawl/filters/url_normalizer.py
Python
def normalize(self, url: str) -> str:
    """
    Normalize a URL for deduplication.

    Args:
        url: The URL to normalize.

    Returns:
        The normalized URL string.
    """
    try:
        parsed = urlparse(url)
    except Exception:
        return url

    # Scheme normalization (lowercase)
    scheme = parsed.scheme.lower()

    # Hostname normalization
    hostname = parsed.netloc
    if self.lowercase_hostname:
        hostname = hostname.lower()

    # Remove default ports
    if self.remove_default_ports:
        if scheme == "http" and hostname.endswith(":80"):
            hostname = hostname[:-3]
        elif scheme == "https" and hostname.endswith(":443"):
            hostname = hostname[:-4]

    # Remove www prefix
    if self.remove_www and hostname.startswith("www."):
        hostname = hostname[4:]

    # Path normalization
    path = parsed.path

    # Remove duplicate slashes
    path = re.sub(r"/+", "/", path)

    # Normalize path encoding
    # Decode safe characters that don't need encoding
    path = path.replace("%7E", "~")

    # Handle trailing slash
    if self.normalize_trailing_slash:
        # Keep trailing slash only for directories (no extension)
        if path and not path.endswith("/"):
            # Check if it looks like a file (has extension)
            last_segment = path.split("/")[-1]
            if "." not in last_segment and path != "/":
                # It's a directory-like path, could add trailing slash
                # But for consistency, we'll remove trailing slashes
                pass
        # Remove trailing slash except for root
        if path != "/" and path.endswith("/"):
            path = path.rstrip("/")

    # Empty path becomes /
    if not path:
        path = "/"

    # Query parameter normalization
    query = parsed.query
    if query:
        params = parse_qs(query, keep_blank_values=True)

        # Remove tracking and specified params
        params_to_remove = self.remove_query_params | self.default_tracking_params
        params = {k: v for k, v in params.items() if k not in params_to_remove}

        # Sort and rebuild query string
        if self.sort_query_params:
            sorted_params = sorted(params.items())
            # Flatten multi-value params
            flat_params = []
            for k, values in sorted_params:
                for v in sorted(values):
                    flat_params.append((k, v))
            query = urlencode(flat_params)
        else:
            query = urlencode(params, doseq=True)
    else:
        query = ""

    # Fragment handling
    fragment = "" if self.remove_fragments else parsed.fragment

    # Rebuild URL
    normalized = urlunparse((scheme, hostname, path, "", query, fragment))

    return normalized

get_domain

Python
get_domain(url: str) -> str

Extract the domain from a URL.

PARAMETER DESCRIPTION
url

The URL.

TYPE: str

RETURNS DESCRIPTION
str

The domain (e.g., 'example.com').

Source code in src/ragcrawl/filters/url_normalizer.py
Python
def get_domain(self, url: str) -> str:
    """
    Extract the domain from a URL.

    Args:
        url: The URL.

    Returns:
        The domain (e.g., 'example.com').
    """
    try:
        parsed = urlparse(url)
        hostname = parsed.netloc.lower()

        # Remove port
        if ":" in hostname:
            hostname = hostname.split(":")[0]

        return hostname
    except Exception:
        return ""

get_registered_domain

Python
get_registered_domain(url: str) -> str

Extract the registered domain (eTLD+1) from a URL.

PARAMETER DESCRIPTION
url

The URL.

TYPE: str

RETURNS DESCRIPTION
str

The registered domain (e.g., 'example.com' for 'sub.example.com').

Source code in src/ragcrawl/filters/url_normalizer.py
Python
def get_registered_domain(self, url: str) -> str:
    """
    Extract the registered domain (eTLD+1) from a URL.

    Args:
        url: The URL.

    Returns:
        The registered domain (e.g., 'example.com' for 'sub.example.com').
    """
    try:
        extracted = tldextract.extract(url)
        if extracted.domain and extracted.suffix:
            return f"{extracted.domain}.{extracted.suffix}"
        return extracted.domain or ""
    except Exception:
        return ""

is_same_domain

Python
is_same_domain(url1: str, url2: str) -> bool

Check if two URLs are on the same domain.

PARAMETER DESCRIPTION
url1

First URL.

TYPE: str

url2

Second URL.

TYPE: str

RETURNS DESCRIPTION
bool

True if same domain.

Source code in src/ragcrawl/filters/url_normalizer.py
Python
def is_same_domain(self, url1: str, url2: str) -> bool:
    """
    Check if two URLs are on the same domain.

    Args:
        url1: First URL.
        url2: Second URL.

    Returns:
        True if same domain.
    """
    return self.get_domain(url1) == self.get_domain(url2)

is_same_registered_domain

Python
is_same_registered_domain(url1: str, url2: str) -> bool

Check if two URLs are on the same registered domain.

This considers subdomains as the same domain.

PARAMETER DESCRIPTION
url1

First URL.

TYPE: str

url2

Second URL.

TYPE: str

RETURNS DESCRIPTION
bool

True if same registered domain.

Source code in src/ragcrawl/filters/url_normalizer.py
Python
def is_same_registered_domain(self, url1: str, url2: str) -> bool:
    """
    Check if two URLs are on the same registered domain.

    This considers subdomains as the same domain.

    Args:
        url1: First URL.
        url2: Second URL.

    Returns:
        True if same registered domain.
    """
    return self.get_registered_domain(url1) == self.get_registered_domain(url2)