extractors

This module contains some class for the extraction logic.

Classes

Dysdera Extractor

class dysdera.extractors.DysderaExtractor[source]

abstract class for the informarion extractor

Abstract class for the information extractor.

If you want to create a custom extractor, your extractor must be a subclass of DysderaExtractor and implement the method extract().

abstract async extract(x: WebTarget)[source]
Parameters:

x (dysdera.web.WebTarget) – Web target to be extracted.

This method should be implemented by subclasses to define the logic for extracting information from a web target.

static page_to_dict(x: WebTarget) dict[source]
Parameters:

x (dysdera.web.WebTarget) – Web target to be transformed into a dictionary.

Returns:

Dictionary containing information extracted from the web target.

Return type:

dict

Transform a WebTarget into a dictionary and returns it. The dictionary contains the following keys:

  • url: URL of the web target.

  • domain: Domain of the URL.

  • name: Page title extracted from the web target.

  • titles: Concatenated titles extracted from the web target.

  • text: Concatenated text content extracted from the web target.

  • figcapt: Concatenated figure captions extracted from the web target.

  • links: List of URLs extracted from the web target.

  • canonical_url: Canonical URL extracted from the web target.

  • meta: Metadata extracted from the web target.

  • visited: Date and time when the web target was visited.

  • lastmod: Date and time of the last modification of the web target.

  • timestamp_UTC: UTC timestamp of the last modification of the web target, if available.

Mongo Extractor

class dysdera.extractors.MongoExtractor(collection: motor.motor_asyncio.AsyncIOMotorCollection, save_if: ~typing.Callable[[dict], bool] = <function MongoExtractor.<lambda>>)[source]

Extractor that saves crawl information in a MongoDB collection as a dict, see DysderaExtractor.page_to_dict().

Parameters:
  • collection (motor.motor_asyncio.AsyncIOMotorCollection) – AsyncIOMotorCollection to save crawl information.

  • save_if (Callable[[dict], bool]) – Function to filter pages to be saved (default is to save all).

Json Extractor

class dysdera.extractors.JsonExtractor(file, save_if: ~typing.Callable[[dict], bool] = <function JsonExtractor.<lambda>>)[source]

Extractor that saves crawl information in a JSON file as a dict, see DysderaExtractor.page_to_dict().

Parameters:
  • file (file-like object) – File to save crawl information.

  • save_if (Callable[[dict], bool]) – Function to filter pages to be saved (default is to save all).

File Extractor

class dysdera.extractors.FileExtractor(*extension: str, output_dir: str = '')[source]

Extractor that save all files with one of the required extensions.

Parameters:
  • extension (str) – Required file extensions.

  • output_dir (str) – Output directory to save files (default is current directory).