extractors
This module contains some class for the extraction logic.
Classes
Dysdera Extractor
- class dysdera.extractors.DysderaExtractor[source]
abstract class for the informarion extractor
Abstract class for the information extractor.
If you want to create a custom extractor, your extractor must be a subclass of
DysderaExtractor
and implement the methodextract()
.- abstract async extract(x: WebTarget)[source]
- Parameters:
x (
dysdera.web.WebTarget
) – Web target to be extracted.
This method should be implemented by subclasses to define the logic for extracting information from a web target.
- static page_to_dict(x: WebTarget) dict [source]
- Parameters:
x (
dysdera.web.WebTarget
) – Web target to be transformed into a dictionary.- Returns:
Dictionary containing information extracted from the web target.
- Return type:
dict
Transform a WebTarget into a dictionary and returns it. The dictionary contains the following keys:
url: URL of the web target.
domain: Domain of the URL.
name: Page title extracted from the web target.
titles: Concatenated titles extracted from the web target.
text: Concatenated text content extracted from the web target.
figcapt: Concatenated figure captions extracted from the web target.
links: List of URLs extracted from the web target.
canonical_url: Canonical URL extracted from the web target.
meta: Metadata extracted from the web target.
visited: Date and time when the web target was visited.
lastmod: Date and time of the last modification of the web target.
timestamp_UTC: UTC timestamp of the last modification of the web target, if available.
Mongo Extractor
- class dysdera.extractors.MongoExtractor(collection: motor.motor_asyncio.AsyncIOMotorCollection, save_if: ~typing.Callable[[dict], bool] = <function MongoExtractor.<lambda>>)[source]
Extractor that saves crawl information in a MongoDB collection as a dict, see
DysderaExtractor.page_to_dict()
.- Parameters:
collection (
motor.motor_asyncio.AsyncIOMotorCollection
) – AsyncIOMotorCollection to save crawl information.save_if (Callable[[dict], bool]) – Function to filter pages to be saved (default is to save all).
Json Extractor
- class dysdera.extractors.JsonExtractor(file, save_if: ~typing.Callable[[dict], bool] = <function JsonExtractor.<lambda>>)[source]
Extractor that saves crawl information in a JSON file as a dict, see
DysderaExtractor.page_to_dict()
.- Parameters:
file (file-like object) – File to save crawl information.
save_if (Callable[[dict], bool]) – Function to filter pages to be saved (default is to save all).