extractors

This module contains some class for the extraction logic.

Classes

Dysdera Extractor

class dysdera.extractors.DysderaExtractor[source]

abstract class for the informarion extractor

Abstract class for the information extractor.

If you want to create a custom extractor, your extractor must be a subclass of DysderaExtractor and implement the method extract().

abstract async extract(x: WebTarget)[source]

Parameters:: x (dysdera.web.WebTarget) – Web target to be extracted.

This method should be implemented by subclasses to define the logic for extracting information from a web target.

static page_to_dict(x: WebTarget) → dict[source]

Parameters:: x (dysdera.web.WebTarget) – Web target to be transformed into a dictionary.
Returns:: Dictionary containing information extracted from the web target.
Return type:: dict

Transform a WebTarget into a dictionary and returns it. The dictionary contains the following keys:

url: URL of the web target.
domain: Domain of the URL.
name: Page title extracted from the web target.
titles: Concatenated titles extracted from the web target.
text: Concatenated text content extracted from the web target.
figcapt: Concatenated figure captions extracted from the web target.
links: List of URLs extracted from the web target.
canonical_url: Canonical URL extracted from the web target.
meta: Metadata extracted from the web target.
visited: Date and time when the web target was visited.
lastmod: Date and time of the last modification of the web target.
timestamp_UTC: UTC timestamp of the last modification of the web target, if available.

Mongo Extractor

class dysdera.extractors.MongoExtractor(collection: motor.motor_asyncio.AsyncIOMotorCollection, save_if: ~typing.Callable[[dict], bool] = <function MongoExtractor.<lambda>>)[source]

Extractor that saves crawl information in a MongoDB collection as a dict, see DysderaExtractor.page_to_dict().

Parameters:

collection (motor.motor_asyncio.AsyncIOMotorCollection) – AsyncIOMotorCollection to save crawl information.
save_if (Callable[[dict], bool]) – Function to filter pages to be saved (default is to save all).

Json Extractor

class dysdera.extractors.JsonExtractor(file, save_if: ~typing.Callable[[dict], bool] = <function JsonExtractor.<lambda>>)[source]

Extractor that saves crawl information in a JSON file as a dict, see DysderaExtractor.page_to_dict().

Parameters:

file (file-like object) – File to save crawl information.
save_if (Callable[[dict], bool]) – Function to filter pages to be saved (default is to save all).

File Extractor

class dysdera.extractors.FileExtractor(*extension: str, output_dir: str = '')[source]

Extractor that save all files with one of the required extensions.

Parameters:

extension (str) – Required file extensions.
output_dir (str) – Output directory to save files (default is current directory).