Filters API¶

ragcrawl provides URL filtering utilities to control which pages are crawled.

Overview¶

Filter	Description
`LinkFilter`	Complete URL filtering with domains, patterns, and deduplication
`PatternMatcher`	Glob/regex pattern matching
`URLNormalizer`	URL normalization and hashing
`ExtensionFilter`	File extension filtering

LinkFilter¶

The main filter class combining all filtering logic.

Python

from ragcrawl.filters import LinkFilter, FilterReason

link_filter = LinkFilter(
    allowed_domains=["docs.example.com"],
    allow_subdomains=True,
    include_patterns=["/docs/*", "/api/*"],
    exclude_patterns=["/admin/*", "*secret*"],
    blocked_extensions=[".pdf", ".zip", ".png"],
)

# Check if URL should be crawled
result = link_filter.filter("https://docs.example.com/api/users")

if result.allowed:
    crawl_url(url)
else:
    print(f"Filtered: {result.reason}")
    # FilterReason.DOMAIN_NOT_ALLOWED
    # FilterReason.EXCLUDED_PATTERN
    # FilterReason.NO_INCLUDE_MATCH
    # FilterReason.BLOCKED_EXTENSION
    # FilterReason.ALREADY_SEEN

Deduplication¶

Python

# Track seen URLs
link_filter.mark_seen("https://example.com/page1")

# Check with deduplication
result = link_filter.filter("https://example.com/page1", check_seen=True)
if not result.allowed:
    print(f"Already seen: {result.reason == FilterReason.ALREADY_SEEN}")

Configuration¶

Option	Type	Description
`allowed_domains`	list[str]	Domains to allow
`allow_subdomains`	bool	Include subdomains
`include_patterns`	list[str]	URL patterns to include
`exclude_patterns`	list[str]	URL patterns to exclude
`blocked_extensions`	list[str]	File extensions to skip

PatternMatcher¶

Match URLs against glob or regex patterns.

Python

from ragcrawl.filters import PatternMatcher

matcher = PatternMatcher(
    include_patterns=["/docs/*", "/api/v1/*"],
    exclude_patterns=["*internal*", "*private*"],
    case_sensitive=False,
)

# Check if URL path should be included
if matcher.should_include("/docs/getting-started"):
    print("URL matches include pattern")

if not matcher.should_include("/admin/settings"):
    print("URL doesn't match any include pattern")

Pattern Syntax¶

Pattern	Matches
`/docs/*`	`/docs/anything`
`*/api/`	`any/path/api/anything`
`*.pdf`	Files ending in .pdf
`/api/v[12]/*`	`/api/v1/` or `/api/v2/`

Patterns support both glob syntax (*, **, ?) and regex (when containing |, ^, $, etc.).

URLNormalizer¶

Normalize URLs for consistent comparison and hashing.

Python

from ragcrawl.filters import URLNormalizer

normalizer = URLNormalizer(
    remove_fragments=True,
    remove_tracking_params=True,
    sort_query_params=True,
)

# Normalize URL
normalized = normalizer.normalize(
    "HTTPS://Example.COM/Page?utm_source=google&id=1#section"
)
# Result: "https://example.com/page?id=1"

# Get domain
domain = normalizer.get_domain("https://docs.example.com/page")
# Result: "docs.example.com"

# Get registered domain
base = normalizer.get_registered_domain("https://docs.example.com/page")
# Result: "example.com"

# Check same domain
same = normalizer.is_same_domain(
    "https://docs.example.com",
    "https://api.example.com",
    include_subdomains=True,
)
# Result: True

Configuration¶

Option	Type	Default	Description
`remove_fragments`	bool	True	Remove URL fragments (#)
`remove_tracking_params`	bool	True	Remove UTM params etc.
`sort_query_params`	bool	True	Sort query parameters
`lowercase_path`	bool	False	Lowercase URL path

ExtensionFilter¶

Filter URLs by file extension.

Python

from ragcrawl.filters import ExtensionFilter

ext_filter = ExtensionFilter(
    blocked_extensions=[".pdf", ".png", ".jpg", ".zip", ".exe"]
)

# Check if URL is blocked
if ext_filter.is_blocked("https://example.com/report.pdf"):
    print("PDF files are blocked")

# Get extension
ext = ext_filter.get_extension("https://example.com/file.tar.gz")
# Result: ".gz"

Integration Example¶

Python

from ragcrawl.filters import LinkFilter

# Create comprehensive filter
link_filter = LinkFilter(
    allowed_domains=["docs.python.org"],
    allow_subdomains=False,
    include_patterns=[
        "/3/*",          # Python 3 docs only
    ],
    exclude_patterns=[
        "*/whatsnew/*",  # Skip what's new pages
        "*/_sources/*",  # Skip source files
    ],
    blocked_extensions=[
        ".pdf", ".zip", ".tar.gz",
        ".png", ".jpg", ".gif", ".svg",
    ],
)

# Use in crawling
for url in discovered_urls:
    result = link_filter.filter(url, check_seen=True)
    if result.allowed:
        link_filter.mark_seen(url)
        queue.add(url)

Module Reference¶

URL filtering and normalization for ragcrawl.

LinkFilter ¶

Python

LinkFilter(
    allowed_domains: list[str] | None = None,
    allow_subdomains: bool = True,
    allowed_schemes: list[str] | None = None,
    allowed_path_prefixes: list[str] | None = None,
    blocked_extensions: list[str] | None = None,
    include_patterns: list[str] | None = None,
    exclude_patterns: list[str] | None = None,
    blocked_query_params: list[str] | None = None,
)

Filters URLs based on domain, path, extension, and pattern constraints.

This is the main filter used by the crawler to determine which URLs to include in the frontier.

Initialize the link filter.

PARAMETER	DESCRIPTION
`allowed_domains`	Domains to allow (empty = all). TYPE: `list[str] \| None` DEFAULT: `None`
`allow_subdomains`	Whether to allow subdomains of allowed_domains. TYPE: `bool` DEFAULT: `True`
`allowed_schemes`	URL schemes to allow (default: http, https). TYPE: `list[str] \| None` DEFAULT: `None`
`allowed_path_prefixes`	Path prefixes to allow (empty = all). TYPE: `list[str] \| None` DEFAULT: `None`
`blocked_extensions`	File extensions to block. TYPE: `list[str] \| None` DEFAULT: `None`
`include_patterns`	Regex/glob patterns for URLs to include. TYPE: `list[str] \| None` DEFAULT: `None`
`exclude_patterns`	Regex/glob patterns for URLs to exclude. TYPE: `list[str] \| None` DEFAULT: `None`
`blocked_query_params`	Query parameters to strip. TYPE: `list[str] \| None` DEFAULT: `None`

Source code in src/ragcrawl/filters/link_filter.py

Python
def __init__(
    self,
    allowed_domains: list[str] | None = None,
    allow_subdomains: bool = True,
    allowed_schemes: list[str] | None = None,
    allowed_path_prefixes: list[str] | None = None,
    blocked_extensions: list[str] | None = None,
    include_patterns: list[str] | None = None,
    exclude_patterns: list[str] | None = None,
    blocked_query_params: list[str] | None = None,
) -> None:
    """
    Initialize the link filter.

    Args:
        allowed_domains: Domains to allow (empty = all).
        allow_subdomains: Whether to allow subdomains of allowed_domains.
        allowed_schemes: URL schemes to allow (default: http, https).
        allowed_path_prefixes: Path prefixes to allow (empty = all).
        blocked_extensions: File extensions to block.
        include_patterns: Regex/glob patterns for URLs to include.
        exclude_patterns: Regex/glob patterns for URLs to exclude.
        blocked_query_params: Query parameters to strip.
    """
    self.allowed_domains = set(d.lower() for d in (allowed_domains or []))
    self.allow_subdomains = allow_subdomains
    self.allowed_schemes = set(s.lower() for s in (allowed_schemes or ["http", "https"]))
    self.allowed_path_prefixes = list(allowed_path_prefixes or [])

    self.normalizer = URLNormalizer(remove_query_params=blocked_query_params)
    self.pattern_matcher = PatternMatcher(include_patterns, exclude_patterns)
    self.extension_filter = ExtensionFilter(blocked_extensions)

    # Track seen URLs for deduplication
    self._seen_urls: set[str] = set()

seen_count `property` ¶

Python

seen_count: int

Get the count of seen URLs.

filter ¶

Python

filter(
    url: str,
    check_seen: bool = True,
    current_depth: int = 0,
    max_depth: int | None = None,
) -> FilterResult

Filter a URL and return the result.

PARAMETER	DESCRIPTION
`url`	The URL to filter. TYPE: `str`
`check_seen`	Whether to check if URL was already seen. TYPE: `bool` DEFAULT: `True`
`current_depth`	Current crawl depth. TYPE: `int` DEFAULT: `0`
`max_depth`	Maximum allowed depth. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`FilterResult`	FilterResult with allowed status and reason.

Source code in src/ragcrawl/filters/link_filter.py

Python
def filter(
    self,
    url: str,
    check_seen: bool = True,
    current_depth: int = 0,
    max_depth: int | None = None,
) -> FilterResult:
    """
    Filter a URL and return the result.

    Args:
        url: The URL to filter.
        check_seen: Whether to check if URL was already seen.
        current_depth: Current crawl depth.
        max_depth: Maximum allowed depth.

    Returns:
        FilterResult with allowed status and reason.
    """
    # Parse and validate URL
    try:
        parsed = urlparse(url)
    except Exception:
        return FilterResult(
            allowed=False,
            reason=FilterReason.INVALID_URL,
            details="Failed to parse URL",
        )

    if not parsed.scheme or not parsed.netloc:
        return FilterResult(
            allowed=False,
            reason=FilterReason.INVALID_URL,
            details="Missing scheme or netloc",
        )

    # Scheme check
    if parsed.scheme.lower() not in self.allowed_schemes:
        return FilterResult(
            allowed=False,
            reason=FilterReason.INVALID_SCHEME,
            details=f"Scheme '{parsed.scheme}' not allowed",
        )

    # Normalize URL
    normalized = self.normalizer.normalize(url)

    # Deduplication check
    if check_seen and normalized in self._seen_urls:
        return FilterResult(
            allowed=False,
            reason=FilterReason.ALREADY_SEEN,
            normalized_url=normalized,
        )

    # Depth check
    if max_depth is not None and current_depth > max_depth:
        return FilterResult(
            allowed=False,
            reason=FilterReason.MAX_DEPTH_EXCEEDED,
            normalized_url=normalized,
            details=f"Depth {current_depth} exceeds max {max_depth}",
        )

    # Domain check
    if self.allowed_domains:
        hostname = parsed.netloc.lower()
        # Remove port if present
        if ":" in hostname:
            hostname = hostname.split(":")[0]

        if not self._is_domain_allowed(hostname):
            return FilterResult(
                allowed=False,
                reason=FilterReason.DOMAIN_NOT_ALLOWED,
                normalized_url=normalized,
                details=f"Domain '{hostname}' not in allowed list",
            )

    # Path prefix check
    if self.allowed_path_prefixes:
        path = parsed.path
        if not any(path.startswith(prefix) for prefix in self.allowed_path_prefixes):
            return FilterResult(
                allowed=False,
                reason=FilterReason.PATH_NOT_ALLOWED,
                normalized_url=normalized,
                details=f"Path '{path}' doesn't match allowed prefixes",
            )

    # Extension check
    if self.extension_filter.is_blocked(url):
        ext = self.extension_filter.get_extension(url)
        return FilterResult(
            allowed=False,
            reason=FilterReason.BLOCKED_EXTENSION,
            normalized_url=normalized,
            details=f"Extension '{ext}' is blocked",
        )

    # Pattern check
    if not self.pattern_matcher.should_include(url):
        reason = self.pattern_matcher.get_match_reason(url)
        if self.pattern_matcher.matches_exclude(url):
            return FilterResult(
                allowed=False,
                reason=FilterReason.EXCLUDED_PATTERN,
                normalized_url=normalized,
                details=reason,
            )
        else:
            return FilterResult(
                allowed=False,
                reason=FilterReason.NO_INCLUDE_MATCH,
                normalized_url=normalized,
                details=reason,
            )

    # URL is allowed
    return FilterResult(
        allowed=True,
        reason=FilterReason.ALLOWED,
        normalized_url=normalized,
    )

mark_seen ¶

Python

mark_seen(url: str) -> str

Mark a URL as seen and return normalized form.

PARAMETER	DESCRIPTION
`url`	The URL to mark. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The normalized URL.

Source code in src/ragcrawl/filters/link_filter.py

Python
def mark_seen(self, url: str) -> str:
    """
    Mark a URL as seen and return normalized form.

    Args:
        url: The URL to mark.

    Returns:
        The normalized URL.
    """
    normalized = self.normalizer.normalize(url)
    self._seen_urls.add(normalized)
    return normalized

is_seen ¶

Python

is_seen(url: str) -> bool

Check if URL has been seen.

PARAMETER	DESCRIPTION
`url`	The URL to check. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	True if URL has been seen.

Source code in src/ragcrawl/filters/link_filter.py

Python
def is_seen(self, url: str) -> bool:
    """
    Check if URL has been seen.

    Args:
        url: The URL to check.

    Returns:
        True if URL has been seen.
    """
    normalized = self.normalizer.normalize(url)
    return normalized in self._seen_urls

clear_seen ¶

Python

clear_seen() -> None

Clear the set of seen URLs.

Source code in src/ragcrawl/filters/link_filter.py

Python
def clear_seen(self) -> None:
    """Clear the set of seen URLs."""
    self._seen_urls.clear()

PatternMatcher ¶

Python

PatternMatcher(
    include_patterns: list[str] | None = None,
    exclude_patterns: list[str] | None = None,
    case_sensitive: bool = False,
)

Matches URLs against include/exclude patterns.

Supports both regex and glob patterns.

Initialize the pattern matcher.

PARAMETER	DESCRIPTION
`include_patterns`	Patterns for URLs to include (regex or glob). TYPE: `list[str] \| None` DEFAULT: `None`
`exclude_patterns`	Patterns for URLs to exclude (regex or glob). TYPE: `list[str] \| None` DEFAULT: `None`
`case_sensitive`	Whether pattern matching is case-sensitive. TYPE: `bool` DEFAULT: `False`

Source code in src/ragcrawl/filters/patterns.py

Python
def __init__(
    self,
    include_patterns: list[str] | None = None,
    exclude_patterns: list[str] | None = None,
    case_sensitive: bool = False,
) -> None:
    """
    Initialize the pattern matcher.

    Args:
        include_patterns: Patterns for URLs to include (regex or glob).
        exclude_patterns: Patterns for URLs to exclude (regex or glob).
        case_sensitive: Whether pattern matching is case-sensitive.
    """
    self.case_sensitive = case_sensitive
    self._include_patterns = self._compile_patterns(include_patterns or [])
    self._exclude_patterns = self._compile_patterns(exclude_patterns or [])

matches_include ¶

Python

matches_include(url: str) -> bool

Check if URL matches any include pattern.

PARAMETER	DESCRIPTION
`url`	The URL to check. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	True if URL matches an include pattern or no include patterns defined.

Source code in src/ragcrawl/filters/patterns.py

Python
def matches_include(self, url: str) -> bool:
    """
    Check if URL matches any include pattern.

    Args:
        url: The URL to check.

    Returns:
        True if URL matches an include pattern or no include patterns defined.
    """
    if not self._include_patterns:
        return True

    return any(p.search(url) for p in self._include_patterns)

matches_exclude ¶

Python

matches_exclude(url: str) -> bool

Check if URL matches any exclude pattern.

PARAMETER	DESCRIPTION
`url`	The URL to check. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	True if URL matches an exclude pattern.

Source code in src/ragcrawl/filters/patterns.py

Python
def matches_exclude(self, url: str) -> bool:
    """
    Check if URL matches any exclude pattern.

    Args:
        url: The URL to check.

    Returns:
        True if URL matches an exclude pattern.
    """
    if not self._exclude_patterns:
        return False

    return any(p.search(url) for p in self._exclude_patterns)

should_include ¶

Python

should_include(url: str) -> bool

Determine if URL should be included based on patterns.

Exclude patterns take precedence over include patterns.

PARAMETER	DESCRIPTION
`url`	The URL to check. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	True if URL should be included.

Source code in src/ragcrawl/filters/patterns.py

Python
def should_include(self, url: str) -> bool:
    """
    Determine if URL should be included based on patterns.

    Exclude patterns take precedence over include patterns.

    Args:
        url: The URL to check.

    Returns:
        True if URL should be included.
    """
    # Exclude takes precedence
    if self.matches_exclude(url):
        return False

    return self.matches_include(url)

get_match_reason ¶

Python

get_match_reason(url: str) -> str | None

Get the reason for inclusion/exclusion.

PARAMETER	DESCRIPTION
`url`	The URL to check. TYPE: `str`

RETURNS	DESCRIPTION
`str \| None`	A string describing the match, or None if included by default.

Source code in src/ragcrawl/filters/patterns.py

Python
def get_match_reason(self, url: str) -> str | None:
    """
    Get the reason for inclusion/exclusion.

    Args:
        url: The URL to check.

    Returns:
        A string describing the match, or None if included by default.
    """
    for pattern in self._exclude_patterns:
        if pattern.search(url):
            return f"excluded by pattern: {pattern.pattern}"

    if self._include_patterns:
        for pattern in self._include_patterns:
            if pattern.search(url):
                return f"included by pattern: {pattern.pattern}"

        return "no include pattern matched"

    return None

URLNormalizer ¶

Python

URLNormalizer(
    remove_fragments: bool = True,
    normalize_trailing_slash: bool = True,
    sort_query_params: bool = True,
    remove_query_params: list[str] | None = None,
    lowercase_hostname: bool = True,
    remove_default_ports: bool = True,
    remove_www: bool = False,
)

Normalizes URLs for deterministic deduplication.

Handles: - Fragment removal - Trailing slash normalization - Query parameter sorting and filtering - Scheme normalization - Case normalization for hostname - Path normalization

Initialize the URL normalizer.

PARAMETER	DESCRIPTION
`remove_fragments`	Remove URL fragments (#...). TYPE: `bool` DEFAULT: `True`
`normalize_trailing_slash`	Ensure consistent trailing slash handling. TYPE: `bool` DEFAULT: `True`
`sort_query_params`	Sort query parameters alphabetically. TYPE: `bool` DEFAULT: `True`
`remove_query_params`	List of query params to remove (e.g., tracking params). TYPE: `list[str] \| None` DEFAULT: `None`
`lowercase_hostname`	Lowercase the hostname. TYPE: `bool` DEFAULT: `True`
`remove_default_ports`	Remove default ports (80, 443). TYPE: `bool` DEFAULT: `True`
`remove_www`	Remove www. prefix from hostname. TYPE: `bool` DEFAULT: `False`

Source code in src/ragcrawl/filters/url_normalizer.py

Python
def __init__(
    self,
    remove_fragments: bool = True,
    normalize_trailing_slash: bool = True,
    sort_query_params: bool = True,
    remove_query_params: list[str] | None = None,
    lowercase_hostname: bool = True,
    remove_default_ports: bool = True,
    remove_www: bool = False,
) -> None:
    """
    Initialize the URL normalizer.

    Args:
        remove_fragments: Remove URL fragments (#...).
        normalize_trailing_slash: Ensure consistent trailing slash handling.
        sort_query_params: Sort query parameters alphabetically.
        remove_query_params: List of query params to remove (e.g., tracking params).
        lowercase_hostname: Lowercase the hostname.
        remove_default_ports: Remove default ports (80, 443).
        remove_www: Remove www. prefix from hostname.
    """
    self.remove_fragments = remove_fragments
    self.normalize_trailing_slash = normalize_trailing_slash
    self.sort_query_params = sort_query_params
    self.remove_query_params = set(remove_query_params or [])
    self.lowercase_hostname = lowercase_hostname
    self.remove_default_ports = remove_default_ports
    self.remove_www = remove_www

    # Default tracking params to remove
    self.default_tracking_params = {
        "utm_source",
        "utm_medium",
        "utm_campaign",
        "utm_term",
        "utm_content",
        "fbclid",
        "gclid",
        "ref",
        "source",
    }

normalize ¶

Python

normalize(url: str) -> str

Normalize a URL for deduplication.

PARAMETER	DESCRIPTION
`url`	The URL to normalize. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The normalized URL string.

Source code in src/ragcrawl/filters/url_normalizer.py

Python
def normalize(self, url: str) -> str:
    """
    Normalize a URL for deduplication.

    Args:
        url: The URL to normalize.

    Returns:
        The normalized URL string.
    """
    try:
        parsed = urlparse(url)
    except Exception:
        return url

    # Scheme normalization (lowercase)
    scheme = parsed.scheme.lower()

    # Hostname normalization
    hostname = parsed.netloc
    if self.lowercase_hostname:
        hostname = hostname.lower()

    # Remove default ports
    if self.remove_default_ports:
        if scheme == "http" and hostname.endswith(":80"):
            hostname = hostname[:-3]
        elif scheme == "https" and hostname.endswith(":443"):
            hostname = hostname[:-4]

    # Remove www prefix
    if self.remove_www and hostname.startswith("www."):
        hostname = hostname[4:]

    # Path normalization
    path = parsed.path

    # Remove duplicate slashes
    path = re.sub(r"/+", "/", path)

    # Normalize path encoding
    # Decode safe characters that don't need encoding
    path = path.replace("%7E", "~")

    # Handle trailing slash
    if self.normalize_trailing_slash:
        # Keep trailing slash only for directories (no extension)
        if path and not path.endswith("/"):
            # Check if it looks like a file (has extension)
            last_segment = path.split("/")[-1]
            if "." not in last_segment and path != "/":
                # It's a directory-like path, could add trailing slash
                # But for consistency, we'll remove trailing slashes
                pass
        # Remove trailing slash except for root
        if path != "/" and path.endswith("/"):
            path = path.rstrip("/")

    # Empty path becomes /
    if not path:
        path = "/"

    # Query parameter normalization
    query = parsed.query
    if query:
        params = parse_qs(query, keep_blank_values=True)

        # Remove tracking and specified params
        params_to_remove = self.remove_query_params | self.default_tracking_params
        params = {k: v for k, v in params.items() if k not in params_to_remove}

        # Sort and rebuild query string
        if self.sort_query_params:
            sorted_params = sorted(params.items())
            # Flatten multi-value params
            flat_params = []
            for k, values in sorted_params:
                for v in sorted(values):
                    flat_params.append((k, v))
            query = urlencode(flat_params)
        else:
            query = urlencode(params, doseq=True)
    else:
        query = ""

    # Fragment handling
    fragment = "" if self.remove_fragments else parsed.fragment

    # Rebuild URL
    normalized = urlunparse((scheme, hostname, path, "", query, fragment))

    return normalized

get_domain ¶

Python

get_domain(url: str) -> str

Extract the domain from a URL.

PARAMETER	DESCRIPTION
`url`	The URL. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The domain (e.g., 'example.com').

Source code in src/ragcrawl/filters/url_normalizer.py

Python
def get_domain(self, url: str) -> str:
    """
    Extract the domain from a URL.

    Args:
        url: The URL.

    Returns:
        The domain (e.g., 'example.com').
    """
    try:
        parsed = urlparse(url)
        hostname = parsed.netloc.lower()

        # Remove port
        if ":" in hostname:
            hostname = hostname.split(":")[0]

        return hostname
    except Exception:
        return ""

get_registered_domain ¶

Python

get_registered_domain(url: str) -> str

Extract the registered domain (eTLD+1) from a URL.

PARAMETER	DESCRIPTION
`url`	The URL. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The registered domain (e.g., 'example.com' for 'sub.example.com').

Source code in src/ragcrawl/filters/url_normalizer.py

Python
def get_registered_domain(self, url: str) -> str:
    """
    Extract the registered domain (eTLD+1) from a URL.

    Args:
        url: The URL.

    Returns:
        The registered domain (e.g., 'example.com' for 'sub.example.com').
    """
    try:
        extracted = tldextract.extract(url)
        if extracted.domain and extracted.suffix:
            return f"{extracted.domain}.{extracted.suffix}"
        return extracted.domain or ""
    except Exception:
        return ""

is_same_domain ¶

Python

is_same_domain(url1: str, url2: str) -> bool

Check if two URLs are on the same domain.

PARAMETER	DESCRIPTION
`url1`	First URL. TYPE: `str`
`url2`	Second URL. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	True if same domain.

Source code in src/ragcrawl/filters/url_normalizer.py

Python
def is_same_domain(self, url1: str, url2: str) -> bool:
    """
    Check if two URLs are on the same domain.

    Args:
        url1: First URL.
        url2: Second URL.

    Returns:
        True if same domain.
    """
    return self.get_domain(url1) == self.get_domain(url2)

is_same_registered_domain ¶

Python

is_same_registered_domain(url1: str, url2: str) -> bool

Check if two URLs are on the same registered domain.

This considers subdomains as the same domain.

PARAMETER	DESCRIPTION
`url1`	First URL. TYPE: `str`
`url2`	Second URL. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	True if same registered domain.

Source code in src/ragcrawl/filters/url_normalizer.py

Python
def is_same_registered_domain(self, url1: str, url2: str) -> bool:
    """
    Check if two URLs are on the same registered domain.

    This considers subdomains as the same domain.

    Args:
        url1: First URL.
        url2: Second URL.

    Returns:
        True if same registered domain.
    """
    return self.get_registered_domain(url1) == self.get_registered_domain(url2)

Filters API¶

Overview¶

LinkFilter¶

Deduplication¶

Configuration¶

PatternMatcher¶

Pattern Syntax¶

URLNormalizer¶

Configuration¶

ExtensionFilter¶

Integration Example¶

Module Reference¶

LinkFilter ¶

seen_count property ¶

filter ¶

mark_seen ¶

is_seen ¶

clear_seen ¶

PatternMatcher ¶

matches_include ¶

matches_exclude ¶

should_include ¶

get_match_reason ¶

URLNormalizer ¶

normalize ¶

get_domain ¶

get_registered_domain ¶

is_same_domain ¶

is_same_registered_domain ¶

seen_count `property` ¶