policy
This module defines the policy class and its subclasses, representing the policy of the web crawler.
Classes
Policy
- class dysdera.policy.Policy(focus_policy: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_true>, sitemap_scheduling_cost: ~typing.Callable[[~typing.Dict[str, str | bool]], int] = <function Policy.<lambda>>, scheduling_cost: ~typing.Callable[[~dysdera.web.WebTarget], int] = <function Policy.<lambda>>, sitemap_selection_policy: ~typing.Callable[[~typing.Dict[str, str | bool]], bool] = <function Policy.<lambda>>, selection_policy: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_true>, headers_before_visit: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_false>, respect_robots=True, agent_name=None, canonical_url=True, default_delay: float = 5, can_dload_without_ssl: ~typing.Callable[[~dysdera.web.WebPage], bool] = <function Policy.<lambda>>, visit_sitemap: ~typing.Callable[[~dysdera.parser.URL], bool] = <function Policy.<lambda>>, dload_if_modified_since=<function unknown_last_modify>)[source]
The Policy class defines the behavior of the web crawler. It contains parameters and methods for deciding which pages to visit and crawl. You can implement a custom Policy by defining a subclass of Policy.
- Parameters:
focus_policy (async (
dysdera.web.WebTarget
) to [bool]) – Coroutine function to determine if links on a page should be visited. Default is to visit all links.sitemap_scheduling_cost (async (
dysdera.web.WebTarget
) to [int]) – Function to calculate the scheduling cost for pages found in sitemaps. Default is 1.scheduling_cost (async (
dysdera.web.WebTarget
) to [int]) – Function to calculate the scheduling cost for individual pages. Default is 1.sitemap_selection_policy (async (
dysdera.web.WebTarget
) to [bool]) – Function to determine if sitemap pages should be visited. Default is to visit all sitemap pages.selection_policy (async (
dysdera.web.WebTarget
) to [bool]) – Coroutine function to determine if a page should be visited. Default is to visit all pages.headers_before_visit (async (
dysdera.web.WebTarget
) to [bool]) – Coroutine function to determine if headers should be fetched before visiting a page. Default is not to fetch headers.respect_robots (bool) – Boolean indicating whether to respect robots.txt rules. Default is True.
agent_name (str) – User agent name for robots.txt. Default is None.
default_delay (int) – Default delay to use if not specified in robots.txt. Default is 5.
force_without_ssl (async (
dysdera.web.WebTarget
) to [bool]) – Function to determine if pages can be downloaded without SSL. Default is False.dload_if_modified_since (async (
dysdera.web.WebTarget
) to [bool]) – Function to determine if a page should be downloaded based on last modification date. Default is unknown.
Methods:
should_visit: Determines if a page should be visited based on the selection policy.
queue_weight: Calculates the weight of a page in the queue.
map_queue_weight: Calculates the weight of pages in a map queue.
should_crawl: Determines if a page should be crawled based on the focus policy.
DomainPolicy
- class dysdera.policy.DomainPolicy(*domains: str, sitemap_scheduling_cost: ~typing.Callable[[~typing.Dict[str, str | bool]], int] = <function DomainPolicy.<lambda>>, scheduling_cost: ~typing.Callable[[~dysdera.web.WebTarget], int] = <function DomainPolicy.<lambda>>)[source]
The DomainPolicy class extends the Policy class to define policies for pages within specified domains.
- Parameters:
domains (List[str]) – Domains for which the policy should be applied.
sitemap_scheduling_cost (async (
dysdera.web.WebTarget
) to [int]) – Function to calculate the scheduling cost for pages found in sitemaps. Default is 1.scheduling_cost (async (
dysdera.web.WebTarget
) to [int]) – Function to calculate the scheduling cost for individual pages. Default is 1.
ExtendedDomainPolicy
- class dysdera.policy.ExtendedDomainPolicy(*domains: str, sitemap_scheduling_cost: ~typing.Callable[[~typing.Dict[str, str | bool]], int] = <function ExtendedDomainPolicy.<lambda>>, scheduling_cost: ~typing.Callable[[~dysdera.web.WebTarget], int] = <function ExtendedDomainPolicy.<lambda>>)[source]
The ExtendedDomainPolicy class extends the Policy class to define policies for pages within specified domains and the pages they point to.
- Parameters:
domains (List[str]) – Domains for which the policy should be applied.
sitemap_scheduling_cost (async (
dysdera.web.WebTarget
) to [int]) – Function to calculate the scheduling cost for pages found in sitemaps. Default is 1.scheduling_cost (async (
dysdera.web.WebTarget
) to [int]) – Function to calculate the scheduling cost for individual pages. Default is 1.
MongoMemoryPolicy
- class dysdera.policy.MongoMemoryPolicy(collection: motor.motor_asyncio.AsyncIOMotorCollection, focus_policy: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_true>, sitemap_scheduling_cost: ~typing.Callable[[~typing.Dict[str, str]], int] = <function MongoMemoryPolicy.<lambda>>, scheduling_cost: ~typing.Callable[[~dysdera.web.WebTarget], int] = <function MongoMemoryPolicy.<lambda>>, sitemap_selection_policy: ~typing.Callable[[~typing.Dict[str, str]], bool] = <function MongoMemoryPolicy.<lambda>>, selection_policy: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_true>, headers_before_visit: ~typing.Callable[[~dysdera.web.WebTarget], ~typing.Awaitable[bool]] = <function default_false>, respect_robots=True, agent_name=None, canonical_url=True, default_delay: float = 5, can_dload_without_ssl: ~typing.Callable[[~dysdera.web.WebPage], bool] = <function MongoMemoryPolicy.<lambda>>, visit_sitemap: ~typing.Callable[[~dysdera.parser.URL], bool] = <function MongoMemoryPolicy.<lambda>>, dload_if_modified_since: ~typing.Callable[[~dysdera.parser.URL], ~datetime.datetime] = <function MongoMemoryPolicy.<lambda>>)[source]
The MongoMemoryPolicy class extends the Policy class to define policies for pages based on data from a MongoDB collection.
- Parameters:
collection (AsyncIOMotorCollection) – AsyncIOMotorCollection containing information about visited pages.
focus_policy (async (
dysdera.web.WebTarget
) to [bool]) – Coroutine function to determine if links on a page should be visited. Default is to visit all links.sitemap_scheduling_cost (async (
dysdera.web.WebTarget
) to [int]) – Function to calculate the scheduling cost for pages found in sitemaps. Default is 1.scheduling_cost (async (
dysdera.web.WebTarget
) to [int]) – Function to calculate the scheduling cost for individual pages. Default is 1.sitemap_selection_policy (async (
dysdera.web.WebTarget
) to [bool]) – Function to determine if sitemap pages should be visited. Default is to visit all sitemap pages.selection_policy (async (
dysdera.web.WebTarget
) to [bool]) – Coroutine function to determine if a page should be visited. Default is to visit all pages.headers_before_visit (async (
dysdera.web.WebTarget
) to [bool]) – Coroutine function to determine if headers should be fetched before visiting a page. Default is not to fetch headers.respect_robots (bool) – Boolean indicating whether to respect robots.txt rules. Default is True.
agent_name (str or None) – User agent name for robots.txt. Default is None.
default_delay (int) – Default delay to use if not specified in robots.txt. Default is 5.
force_without_ssl (async (
dysdera.web.WebTarget
) to [bool]) – Function to determine if pages can be downloaded without SSL. Default is False.dload_if_modified_since (async (
dysdera.web.WebTarget
) to [bool]) – Function to determine if a page should be downloaded based on last modification date. Default is unknown.