policy

This module defines the policy class and its subclasses, representing the policy of the web crawler.

Classes

Policy

class dysdera.policy.Policy(focus_policy: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_true>, sitemap_scheduling_cost: ~typing.Callable[[~typing.Dict[str, str | bool]], int] = <function Policy.<lambda>>, scheduling_cost: ~typing.Callable[[~dysdera.web.WebTarget], int] = <function Policy.<lambda>>, sitemap_selection_policy: ~typing.Callable[[~typing.Dict[str, str | bool]], bool] = <function Policy.<lambda>>, selection_policy: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_true>, headers_before_visit: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_false>, respect_robots=True, agent_name=None, canonical_url=True, default_delay: float = 5, can_dload_without_ssl: ~typing.Callable[[~dysdera.web.WebPage], bool] = <function Policy.<lambda>>, visit_sitemap: ~typing.Callable[[~dysdera.parser.URL], bool] = <function Policy.<lambda>>, dload_if_modified_since=<function unknown_last_modify>)[source]

The Policy class defines the behavior of the web crawler. It contains parameters and methods for deciding which pages to visit and crawl. You can implement a custom Policy by defining a subclass of Policy.

Parameters:
  • focus_policy (async (dysdera.web.WebTarget) to [bool]) – Coroutine function to determine if links on a page should be visited. Default is to visit all links.

  • sitemap_scheduling_cost (async (dysdera.web.WebTarget) to [int]) – Function to calculate the scheduling cost for pages found in sitemaps. Default is 1.

  • scheduling_cost (async (dysdera.web.WebTarget) to [int]) – Function to calculate the scheduling cost for individual pages. Default is 1.

  • sitemap_selection_policy (async (dysdera.web.WebTarget) to [bool]) – Function to determine if sitemap pages should be visited. Default is to visit all sitemap pages.

  • selection_policy (async (dysdera.web.WebTarget) to [bool]) – Coroutine function to determine if a page should be visited. Default is to visit all pages.

  • headers_before_visit (async (dysdera.web.WebTarget) to [bool]) – Coroutine function to determine if headers should be fetched before visiting a page. Default is not to fetch headers.

  • respect_robots (bool) – Boolean indicating whether to respect robots.txt rules. Default is True.

  • agent_name (str) – User agent name for robots.txt. Default is None.

  • default_delay (int) – Default delay to use if not specified in robots.txt. Default is 5.

  • force_without_ssl (async (dysdera.web.WebTarget) to [bool]) – Function to determine if pages can be downloaded without SSL. Default is False.

  • dload_if_modified_since (async (dysdera.web.WebTarget) to [bool]) – Function to determine if a page should be downloaded based on last modification date. Default is unknown.

Methods:

  • should_visit: Determines if a page should be visited based on the selection policy.

  • queue_weight: Calculates the weight of a page in the queue.

  • map_queue_weight: Calculates the weight of pages in a map queue.

  • should_crawl: Determines if a page should be crawled based on the focus policy.

DomainPolicy

class dysdera.policy.DomainPolicy(*domains: str, sitemap_scheduling_cost: ~typing.Callable[[~typing.Dict[str, str | bool]], int] = <function DomainPolicy.<lambda>>, scheduling_cost: ~typing.Callable[[~dysdera.web.WebTarget], int] = <function DomainPolicy.<lambda>>)[source]

The DomainPolicy class extends the Policy class to define policies for pages within specified domains.

Parameters:
  • domains (List[str]) – Domains for which the policy should be applied.

  • sitemap_scheduling_cost (async (dysdera.web.WebTarget) to [int]) – Function to calculate the scheduling cost for pages found in sitemaps. Default is 1.

  • scheduling_cost (async (dysdera.web.WebTarget) to [int]) – Function to calculate the scheduling cost for individual pages. Default is 1.

async url_same_domain(x: WebTarget) bool[source]

ExtendedDomainPolicy

class dysdera.policy.ExtendedDomainPolicy(*domains: str, sitemap_scheduling_cost: ~typing.Callable[[~typing.Dict[str, str | bool]], int] = <function ExtendedDomainPolicy.<lambda>>, scheduling_cost: ~typing.Callable[[~dysdera.web.WebTarget], int] = <function ExtendedDomainPolicy.<lambda>>)[source]

The ExtendedDomainPolicy class extends the Policy class to define policies for pages within specified domains and the pages they point to.

Parameters:
  • domains (List[str]) – Domains for which the policy should be applied.

  • sitemap_scheduling_cost (async (dysdera.web.WebTarget) to [int]) – Function to calculate the scheduling cost for pages found in sitemaps. Default is 1.

  • scheduling_cost (async (dysdera.web.WebTarget) to [int]) – Function to calculate the scheduling cost for individual pages. Default is 1.

async page_same_domain(x: WebTarget) bool[source]

MongoMemoryPolicy

class dysdera.policy.MongoMemoryPolicy(collection: motor.motor_asyncio.AsyncIOMotorCollection, focus_policy: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_true>, sitemap_scheduling_cost: ~typing.Callable[[~typing.Dict[str, str]], int] = <function MongoMemoryPolicy.<lambda>>, scheduling_cost: ~typing.Callable[[~dysdera.web.WebTarget], int] = <function MongoMemoryPolicy.<lambda>>, sitemap_selection_policy: ~typing.Callable[[~typing.Dict[str, str]], bool] = <function MongoMemoryPolicy.<lambda>>, selection_policy: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_true>, headers_before_visit: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_false>, respect_robots=True, agent_name=None, canonical_url=True, default_delay: float = 5, can_dload_without_ssl: ~typing.Callable[[~dysdera.web.WebPage], bool] = <function MongoMemoryPolicy.<lambda>>, visit_sitemap: ~typing.Callable[[~dysdera.parser.URL], bool] = <function MongoMemoryPolicy.<lambda>>, dload_if_modified_since: ~typing.Callable[[~dysdera.parser.URL], ~datetime.datetime] = <function MongoMemoryPolicy.<lambda>>)[source]

The MongoMemoryPolicy class extends the Policy class to define policies for pages based on data from a MongoDB collection.

Parameters:
  • collection (AsyncIOMotorCollection) – AsyncIOMotorCollection containing information about visited pages.

  • focus_policy (async (dysdera.web.WebTarget) to [bool]) – Coroutine function to determine if links on a page should be visited. Default is to visit all links.

  • sitemap_scheduling_cost (async (dysdera.web.WebTarget) to [int]) – Function to calculate the scheduling cost for pages found in sitemaps. Default is 1.

  • scheduling_cost (async (dysdera.web.WebTarget) to [int]) – Function to calculate the scheduling cost for individual pages. Default is 1.

  • sitemap_selection_policy (async (dysdera.web.WebTarget) to [bool]) – Function to determine if sitemap pages should be visited. Default is to visit all sitemap pages.

  • selection_policy (async (dysdera.web.WebTarget) to [bool]) – Coroutine function to determine if a page should be visited. Default is to visit all pages.

  • headers_before_visit (async (dysdera.web.WebTarget) to [bool]) – Coroutine function to determine if headers should be fetched before visiting a page. Default is not to fetch headers.

  • respect_robots (bool) – Boolean indicating whether to respect robots.txt rules. Default is True.

  • agent_name (str or None) – User agent name for robots.txt. Default is None.

  • default_delay (int) – Default delay to use if not specified in robots.txt. Default is 5.

  • force_without_ssl (async (dysdera.web.WebTarget) to [bool]) – Function to determine if pages can be downloaded without SSL. Default is False.

  • dload_if_modified_since (async (dysdera.web.WebTarget) to [bool]) – Function to determine if a page should be downloaded based on last modification date. Default is unknown.

async was_not_modified(page: URL)[source]