dysderacrawler

This module contains the class DysderaCrawler, where the logic of the web crawler is defined. It is asynchronous, so you should run its methods inside an async function (see Example Usage for more).

Classes

DysderaCrawler

class dysdera.dysderacrawler.DysderaCrawler(verbose=False, verbose_log=False, max_timeout=10, duplicate_sensibility: int = 0)[source]

The DysderaCrawler class defines the behavior of the web crawler.

Parameters:

verbose (bool) – If True, the crawler will display its actions on the command line during crawling. (Default: False)
verbose_log (bool) – If True, information about the crawler’s actions will also be written to the log file. (Default: False)
max_timeout (int) – Maximum timeout used when downloading a page. (Default: 10)
duplicate_sensibility (int) – Determines how the crawler handles duplicate pages. If 1, the crawler will ignore pages with the same hash. If greater than 1, it will ignore pages with the same simhash. (Default: 0)

Methods:

async start(session, policy: Policy, extractor: DysderaExtractor, *domains: str)[source]

Parameters:

session (aiohttp.ClientSession) – The aiohttp session for the crawl.

policy (dysdera.Policy) – The policy the crawler should adopt for this crawl.

extractor (dysdera.DysderaExtractor) – The type of extractor to use for this crawl, depending on how you want to save the web pages.

domains (str) – Entry points for the crawl, represented as strings representing valid domains.

This method is called to start the crawl.

async terminate()[source]

This method is called to terminate the crawl.

Example Usage

Here’s an example of how to use the DysderaCrawler module:

import asyncio
import aiohttp
from motor.motor_asyncio import AsyncIOMotorClient
from dysdera.dysderacrawler import DysderaCrawler
from dysdera.extractors import MongoExtractor
from dysdera.policy import Policy

async def main(collection):
    crawler = DysderaCrawler(verbose=True, max_timeout=50)
    async with aiohttp.ClientSession() as session:
        await crawler.start(session, Policy(),
                            MongoExtractor(collection),
                            'https://www.example.com/',
                            # more starting domains if you want
        )

if __name__ == "__main__":
    mongo = AsyncIOMotorClient("mongodb://localhost:27017")
    try:
        loop = asyncio.new_event_loop()
        loop.run_until_complete(main(mongo.dysderadb.film))
    finally:
        mongo.close()