selectionpolicy
In this file, there are selection policies and functions to calculate the cost of visiting a page that could be helpful.
the functions are contained in different classes for order, if you want a custom and more intelligent selection policy or scheduling function you can implemet one as a class, but it need to implement the async method __call__(x: WebTarget) -> bool
like AgedSelectionPolicy
Classes
AgedSelectionPolicy
- class dysdera.selectionpolicy.AgedSelectionPolicy(collection, max_age=20, not_present=False)[source]
class witch object implements a selection policy based on the age of the page, calculated from a previus database of crawls
This class implements a selection policy based on the age of the page, calculated from a previous database of crawls.
- Parameters:
collection (MongoDB collection) – MongoDB collection containing crawl data.
max_age (int) – Maximum age of the page in days (default is 20).
not_present (bool) – Return value if the page is not present in the database (default is False).
- Methods:
__call__: Asynchronously determines if the page should be visited based on its age.
SelectionPolicy
- class dysdera.selectionpolicy.SelectionPolicy[source]
Contains static methods to create various selection policies based on different criteria.
- Static Methods:
must_contain: Returns a policy that checks if the URL contains a specific word.
same_domain: Returns a policy that checks if the URL belongs to the same domain as a given URL.
not_true: Returns a policy that negates the given policy.
all_true: Returns a policy that checks if all given policies are true.
at_least_one_true: Returns a policy that checks if at least one of the given policies is true.
SelectionPolicyWithHeaders
- class dysdera.selectionpolicy.SelectionPolicyWithHeaders[source]
Contains static methods to create selection policies based on header information.
- Static Methods:
modify_only_before: Returns a policy that checks if the last modified date is before a given date.
modify_only_after: Returns a policy that checks if the last modified date is after a given date.
modify_between: Returns a policy that checks if the last modified date is between two given dates.
is_html: Returns a policy that checks if the page is HTML.
SitemapSelectionPolicy
- class dysdera.selectionpolicy.SitemapSelectionPolicy[source]
Contains static methods to create selection policies for sitemap information.
- Static Methods:
not_true: Returns a policy that negates the given policy.
all_true: Returns a policy that checks if all given policies are true.
at_least_one_true: Returns a policy that checks if at least one of the given policies is true.
modify_only_before: Returns a policy that checks if the last modified date in sitemap is before a given date.
modify_only_after: Returns a policy that checks if the last modified date in sitemap is after a given date.
modify_between: Returns a policy that checks if the last modified date in sitemap is between two given dates.
is_news: Returns a policy that checks if the page is a news item.
news_contains: Returns a policy that checks if the news title, name, or keywords contain a specific word.
SchedulingCost
- class dysdera.selectionpolicy.SchedulingCost[source]
Contains methods to calculate the cost of visiting a page based on different criteria.
- Static Methods:
fifo: Returns a policy that assigns a cost of 1 for a breadth-first search.
lifo: Returns a policy that assigns a cost of -1 for a depth-first search.
from_selection_policy: Returns a policy based on a selection policy.
url_contains: Returns a policy that assigns a cost if the URL contains a specific word.
combine: Returns a policy that combines multiple policies.
multiply: Returns a policy that multiplies the costs of two policies.
SchedulingCostWithHeader
SitemapSchedulingCost
- class dysdera.selectionpolicy.SitemapSchedulingCost[source]
Contains methods to calculate the cost of visiting a page based on sitemap information.
- Static Methods:
combine`: Returns a policy that combines multiple policies.
from_selection_policy: Returns a policy based on a selection policy.
latest_modify: Returns a policy that assigns a cost based on the latest modified date in sitemap.
Note: Ensure that you have appropriate imports and configurations for the classes and methods mentioned in this documentation.