web

This module contains the class WebTarget

Classes

Web Target

class dysdera.web.WebTarget(session: aiohttp.ClientSession, url: URL, timeout: int, refer: URL | None = None, if_modified_since: datetime | None = None)[source]

class for all html or not html pages we will meet during the crawling

It represents a web page that can be either HTML or non-HTML. It provides methods for extracting information from the page, such as links, titles, text, and metadata. By now it can perform link extraction only on html pages.

Methods:

lxml_is_html: Checks if the page is HTML based on its content.
extract_links: Extracts links from the page.
canonical_url: Gets the canonical URL of the page.
extract_titles: Extracts titles from the page.
extract_page_title: Extracts the page title.
extract_text: Extracts text from the page.
extract_figcaptions: Extracts figcaptions from the page.
extract_metadata: Extracts metadata from the page.

property request_header: dict: returns the headers for the http request